-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
cache sync results for a while after the request completes #3880
Comments
ftr I did observe this behaviour on my personal homeserver a number of times. It was most evident when a client like riot-android decided it needed to resync the world - the logs showed requests coming in and stacking up, holding the lock quite effectively. After about 5-10 minutes the lock would clear and tens of requests would start getting flagged as processed - this would take 10-30 seconds. Then the client sends a plain sync again, to which the logs say there's no cached response and it tries again. To break the loop, the client had to be lucky enough to have a request just before the homeserver spent several seconds sending responses. |
To be clear, it takes upwards of an hour for riot to sync, despite individual sync requests completing on the order of 15 minutes. Usually what happens is that Riot cancels numerous syncs but leaves a few running; eventually, all currently-running requests laboriously complete, while new ones start over from scratch. This repeats until somehow one finally catches. |
right. 15m * ~5 = ~1h, so that fits this bug - and @turt2live's description of the observed behaviour precisely matches my theory. @Ralith: can you cherrypick 0ca2857 and see if it magically fixes itself? |
With that commit cherry-picked, I did not observe spurious retries on an initial sync, though I may have just gotten lucky on the timing. |
We should offer this as a config option, because from @erikjohnston a) its a pain for sytest and b) giving people stale initial syncs makes their client think that all subsequent messages happened "now", so you end up being notified for everything in the last N minutes (both of which could potentially be fixed by caching the initial sync, but then update it with an incremental sync if necessary) Also sucks up more mem. |
for links: #8518 seems related here. We should make sure that is investigated before "fixing" this issue. |
#8518 has been fixed, this issue is unblocked :D |
Both @lampholder and @Ralith have reported similar sounding problems where initial /sync takes ~5x longer to complete from Riot than it should.
@lampholder mentioned that he tried calling initial /sync directly from curl against his HS (bypassing nginx) without a clientside timeout and it took 3 mins, but when called by Riot (which times out reqs every 90s or so to avoid network glitches wedging a request) it takes ~15 mins. Meanwhile, @Ralith (on an underpowered server with bad IO) takes 15 mins(!) to generate the sync when called directly, but >1h when called from Riot :/
I haven't seen logs from either, but my theory is that if the client times out the /sync request after the server has generated the response, but before it has sent it to the client, then the next time the client retries, the /sync will be calculated from scratch, and the failure mode may continue.
It looks like the current behaviour on develop caches sync requests via response_cache which blocks additional requests for the same sync parameters until any ongoing one completes, and then they all return with the same result:
synapse/synapse/handlers/sync.py
Lines 219 to 225 in b1580f5
However, on the matrix-org-hotfixes branch @richvdh changed it to cache sync responses for an additional 2 minutes, so that any retrying clients will get the cached copy rather than starting from scratch: 0ca2857. This is similar to the old /initialSync snapshot_cache from #457 which never seems to have made it over to /sync.
I'm not seeing any additional /sync response cache?
If my theory is correct, it looks like this hotfix needs to be ported back into develop, and sytest fixed so that it doesn't fail if initial /sync returns stale data (perhaps we could add a null filter on or something as a cachebuster).
Another explanation for this could be if Twisted is correctly cancelling the ongoing /sync when the client goes away, thus throwing away the cached response. The same fix would mitigate that, however.
The text was updated successfully, but these errors were encountered: