-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Sync stream is either missing echo of events or is sending them earlier than anticipated #7206
Comments
This rageshake is my ~22:00 UTC attempt: https://github.com/matrix-org/riot-web-rageshakes/issues/2501 (I think, I might have gotten the timestamp wrong). |
this is valid behaviour though. |
Yup, and it seems like Riot is okay with this. My attempt revealed no syncs with the event ID contained within, aside from a sync which had the second user's read receipt for the event. The event ID did not appear early or late. |
well there are a boatload of errors in the logs which suggest that some replication updates aren't being processed correctly. They include things like:
(also My inclination would be to fix those, then redeploy and see if the problem is magically cured. |
For matrix-org/synapse#7206 Intended to be removed (or possibly converted into a proper off-by-default setting) once the issue concludes. There's a bunch of null checking through defaults here because this is a very critical part of the whole stack. We default to writing errors to the console, but do not break the app for failing to log info about the sync. This may have adverse affects on performance for large accounts - use `localStorage.setItem("mx_skip_sync_logging", "true")` to disable the worst parts.
Logging for riot-web added here: matrix-org/matrix-js-sdk#1300 Might help narrow it down, maybe. |
For the record, we rolled back matrix.org's synapse deployment on 2nd April while we investigated these problems. As of 11:06 UTC this morning, we have redeployed matrix.org with current develop. We've fixed the obvious exceptions being spewed by the replication connections, so I am optimistic that this problem has gone away; however, we never really got to the bottom of what was going on, so it's entirely possible there are still problems. In other words: feedback welcome as to whether this bug seems to be fixed or not. |
... and the bug is still biting, so rolled back again as of 16:34 UTC |
Right; I have reproduced this and tracked it down. It happens after the replication connection disconnects. After a disconnection, we perform a catchup, potentially processing multiple events as having the same stream id. |
Steps to reproduce:
|
hopefully fixed by #7286 |
…7303) First some background: StreamChangeCache is used to keep track of what "entities" have changed since a given stream ID. So for example, we might use it to keep track of when the last to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2]. Now, it turns out that StreamChangeCache didn't support more than one thing being changed at a given stream_id (this was part of the problem with #7206). However, it's entirely valid to send to-device messages to more than one user at a time. As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped ok with having multiple things changing on the same stream ID, and it seems we never actually use the methods which don't work on the stream change caches where we allow multiple changes at the same stream ID. But that feels horribly fragile, hence: let's update StreamChangeCache to properly support this, and add some typing and some more tests while we're at it. [1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301 [2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in matrix-org#7024, and I hope this fixes matrix-org#7206.
…atrix-org#7303) First some background: StreamChangeCache is used to keep track of what "entities" have changed since a given stream ID. So for example, we might use it to keep track of when the last to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2]. Now, it turns out that StreamChangeCache didn't support more than one thing being changed at a given stream_id (this was part of the problem with matrix-org#7206). However, it's entirely valid to send to-device messages to more than one user at a time. As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped ok with having multiple things changing on the same stream ID, and it seems we never actually use the methods which don't work on the stream change caches where we allow multiple changes at the same stream ID. But that feels horribly fragile, hence: let's update StreamChangeCache to properly support this, and add some typing and some more tests while we're at it. [1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301 [2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
On March 31st (roughly) there was an uptick of "message stuck as latest message" reports to the riot clients (riot-web's upstream issue is element-hq/element-web#10032 for this and other cases).
The cause looks to be because Synapse is not sending events down
/sync
sometimes for events we've sent. There are also some reports of events just not being received on the other end which could be related. Most reporters were on matrix.org, and the reports seemed to stop once matrix.org rolled back.Reproduction steps appear to be time-dependent, and possibly only useful for matrix.org-sized homeservers:
/sync
request/send/m.room.message
completes)/sync
/sync request
/sync
, but not the second (sometimes the other way around)/sync
, making it stuck on RiotMost easily reproduced during high traffic volumes on matrix.org, though one case was also reproduced at ~22:00 UTC yesterday (but never again that night).
Theory at least from my side is that the request hits a busy synchrotron which advances the stream token past the second event to the third event, invoking amnesia.
@erikjohnston's conclusions appear to be that synapse might be sending the event down
/sync
ahead of the/send/m.room.message
request completing?The text was updated successfully, but these errors were encountered: