Retrieving Events For Notifications Can Get Quite Slow
Introduction
When it comes to notifications, speed and reliability are crucial. However, in our experience with Element X Android, we've encountered several issues that can cause notifications to be delayed or even missed. In this article, we'll explore these issues and discuss potential solutions to improve the notification experience.
Issue 1: Only One Event Can Be Retrieved at a Time
The NotificationClient
was designed to retrieve events one at a time. While this might seem like a straightforward approach, it can lead to issues when time-sensitive events are waiting to be processed. For instance, when a call is received, we might receive two events almost simultaneously: a m.call.encryption-keys
event and a m.call.notify
event. In such cases, the encryption keys event might be received first, causing us to try to fetch it first, and then the time-sensitive call notify
event is next, resulting in unwanted delays.
Issue 2: Inefficient Flow for Retrieving Events
The current flow for retrieving events involves trying to fetch the event from a /sync
endpoint, and then using /context
as a fallback. However, this approach has several issues. The /sync
request has a 1s timeout to connect and 1s timeout to retrieve the data, with 3 retries with exponential backoff. This can lead to significant delays, especially when the /sync
request times out and is retried several times. In one case, we observed a delay of close to 20s due to this issue.
Issue 3: Delayed Event Retrieval
In some cases, we've observed that an event is successfully fetched, but the next event (received almost simultaneously) is fetched only after the next sync response arrives. This can result in significant delays, with one instance taking around 10s to fetch the next event.
Additional Delays: Android Device Locking and Firebase Cloud Messaging
When an Android device is locked, it enters 'deep sleep' mode and informs Firebase Cloud Messaging about this. This means that push notifications won't be real-time unless they have a TTL=0. At the moment, we believe that this value is set to 15s by default in Sygnal for all events. This can result in several notifications being batched in 'buckets' and sent at the same time after 5, 10, or 15s have elapsed.
Potential Solutions
To address these issues, we propose the following solutions:
- Allow Notification Events to be Fetched in Batches: This would enable us to retrieve multiple events at once, reducing the delay between events.
- Review Timeout/Retry Strategy for /sync Attempts: We should review the timeout and retry strategy for the
/sync
attempts to ensure that it's efficient and doesn't lead to significant delays.
Conclusion
Q: What is the current flow for retrieving events?
A: The current flow involves trying to fetch the event from a /sync
endpoint, and then using /context
as a fallback. This approach has several issues, including a 1s timeout to connect and 1s timeout to retrieve the data, with 3 retries with exponential backoff.
Q: Why is the /sync
request taking so long?
A: The /sync
request has a 1s timeout to connect and 1s timeout to retrieve the data, with 3 retries with exponential backoff. This can lead to significant delays, especially when the /sync
request times out and is retried several times.
Q: What is the impact of Android device locking on notifications?
A: When an Android device is locked, it enters 'deep sleep' mode and informs Firebase Cloud Messaging about this. This means that push notifications won't be real-time unless they have a TTL=0. At the moment, we believe that this value is set to 15s by default in Sygnal for all events.
Q: Why are notifications being batched in 'buckets'?
A: Notifications are being batched in 'buckets' because of the TTL (time to live) value set in Sygnal. This value determines how long a notification is kept in the system before it's sent. Currently, this value is set to 15s, which means that notifications are being batched and sent at the same time after 5, 10, or 15s have elapsed.
Q: What is the proposed solution to address these issues?
A: The proposed solution involves allowing notification events to be fetched in batches and reviewing the timeout/retry strategy for the /sync
attempts. This would enable us to retrieve multiple events at once, reducing the delay between events, and ensure that the /sync
request is efficient and doesn't lead to significant delays.
Q: How will these changes impact the notification experience?
A: These changes will improve the notification experience by reducing the delay between events and ensuring that notifications are sent in real-time. This will result in a more timely and relevant notification experience for users.
Q: What is the next step in implementing these changes?
A: The next step involves reviewing the current implementation and making the necessary changes to allow notification events to be fetched in batches and to review the timeout/retry strategy for the /sync
attempts. This will involve working with the development team to implement the changes and testing them to ensure that they meet the required standards.
Q: When can we expect to see these changes implemented?
A: The timeline for implementing these changes is still to be determined. However, we are working to prioritize these changes and implement them as soon as possible to improve the notification experience for users.