-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Investigate why /messages
is slow
#13356
Comments
The after timing can be influenced by slow `/state_ids` requests when proessing the pulled events. Part of #13356
… two events (#13586) Split off from #13561 Part of #13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)
We can follow-up this PR with: 1. Only try to backfill from an event if we haven't tried recently -> #13622 1. When we decide to backfill that event again, process it in the background so it doesn't block and make `/messages` slow when we know it will probably fail again -> #13623 1. Generally track failures everywhere we try and fail to pull an event over federation -> #13700 Fix #13621 Part of #13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.qv7cj51sv9i5)
…se (#13879) There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history. Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after #13635 This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place. Part of #13356
Because we're doing the recording in `_check_sigs_and_hash_for_pulled_events_and_fetch` (previously named `_check_sigs_and_hash_and_fetch`), this means we will track signature failures for `backfill`, `get_room_state`, `get_event_auth`, and `get_missing_events` (all pulled event scenarios). And we also record signature failures from `get_pdu`. Part of #13700 Part of #13676 and #13356 This PR will be especially important for #13816 so we can avoid the costly `_get_state_ids_after_missing_prev_event` down the line when `/messages` calls backfill.
ProgressWe've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped #13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1663806404570&to=1665016004570&viewPanel=230 The effects don't really show for the What's left?
Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776 As you can see, we need to:
If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s |
…ure is invalid (#13816) While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place. Related to - #13622 - #13635 - #13676 Part of #13356 Follow-up to #13815 which tracks event signature failures. With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now. For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761 To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid. - `backfill` - `outgoing-federation-request` `/backfill` - `_check_sigs_and_hash_and_fetch` - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - `_process_pulled_events` - `_process_pulled_event` for each validated event - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it - `_get_state_ids_after_missing_prev_event` - `outgoing-federation-request` `/state_ids` - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
Seems like it will make it a magnitude faster. Fix #14108 Part of #13356 (comment)
Progress reportThe benchmark Benchmark request: What's left?The big things:
The smaller things:
But all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage. In terms of waterfalls:
There is a 160ms mystery gap after What's achievable?Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another |
Background
Traces for reference
These include a bunch of the
/messages
instrumentation added:/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761Why is it slow?
Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11
This part is WIP. Just documenting some of the log diving I've done for
/messages
being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.1. Backfill linearizer lock takes forever
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
117.808s
GET-6890
gitter.ems.host
(setDEBUG
), see this comment.Notes:
117.808s
. Where did the rest of the time go?Timing summary from looking at the logs:
03:42:43.129
(t+0s
): Received request03:42:43.712
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'03:43:26.806
(t+43s
) Acquired linearizer lock 'room_backfill'03:43:26.825
(t+43s
) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill pointsget_oldest_event_ids_with_depth_in_room
only took .12s to get the 4.6k events03:43:28.716
(t+45s
) Asking t2bot.io for backill03:43:28.985
(t+45s
) t2bot.io responded with 100 events03:43:29.009
(t+46s
) Starting to verify content hashes03:43:29.105
(t+46s
) Done verifying content hashes03:43:29.106
(t+46s
) Processing the pulled events03:43:29.147
(t+46s
) Done processing 100 backfilled except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348
(t+46s
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)03:43:34.026
(t+51s
)_get_state_groups_from_groups
start03:43:38.470
(t+55s
)_get_state_groups_from_groups
end03:43:41.112
(t+58s
) Response sentThis isn't a fluke, here is another request I went through the logs on (
GET-196
). This time the duration matched the Synapse logs pretty well:20:38:38.049
(t+0s
) Received request20:38:38.062
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'20:39:23.622
(t+45s
) Acquired linearizer lock 'room_backfill'20:39:23.640
(t+45s
)_maybe_backfill_inner
and pulled 4.6k events out of the database as potential backfill points20:39:25.625
(t+47s
) Asking t2bot.io for backill20:39:35.262
(t+57s
) t2bot.io responded with 100 events20:39:35.283
(t+...
) Starting to verify content hashes20:39:35.382
(t+...
) Done verifying content hashes20:39:35.383
(t+...
) Processing the pulled events20:39:35.424
(t+...
) Done processing 100 backfilled events except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651
(t+...
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)20:39:43.668
(t+65s
) Response sent2. Loading tons of events
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
GET-5541
Notes:
get_aggregation_groups_for_event
ofget_recent_references_for_event
calls (2k). Why?state_events
in the roomTiming summary from looking at the logs:
02:09:51.026
(t+0s
) Received request02:09:51.959
(t+1s
)_maybe_backfill_inner
backfill and pulled 4.6k events out of the database as potential backfill points02:09:52.726
(t+2s
) synapse.storage.databases.main.events_worker Loading 79277 events (why?)02:10:10.462
(t+19s
) Also fetching redaction events02:10:10.462
(t+19s
) Loaded 79277 events02:10:23.246
(t+31s
) Done redacting 105 events (why?)02:10:23.779
(t+31s
) Asking t2bot.io for backill02:10:33.290
(t+41s
) t2bot.io responded with 100 events...
(t+...
) 2k calls toget_recent_references_for_event
andget_aggregation_groups_for_event
(why?)02:10:54.261
(t+63s
) Response sent3. Slow
/state_id
requests but also really slow processing524 timeout: 123.26s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43
123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab
/backfill
request/state_ids
524 timeout: 117.96s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56
117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9
/backfill
request/state_ids
request524 timeout: 115.64s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33
115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae
/backfill
request/state_ids
request200 ok: 83.51s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok
83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282
/backfill
request/state_ids
request200 ok: 75.7s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm
75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c
/backfill
request/state_ids
request/messages
timing scriptEvery 61 minutes (to be outside of the state cache expiry), will call and time
/messages
for each room defined. And will do this with?limit=
500
,100
,20
to see if the amount of messages matters on the timing.Run in the browser. Define
let MATRIX_TOKEN = 'xxx';
before running the script.matrix-messages-timing.js
The text was updated successfully, but these errors were encountered: