You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.
Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}
?limit=500: 111.81s (492 events / 129 state events)
?limit=100: 119.504s (92 events / 50 state events)
?limit=20: 112.541s (20 events / 11 state events)
After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s
What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, matrix-org/synapse#10742
Caching the response on our end in the archive doesn't make the request fast in the first place.
From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.
This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.
1. Backfill linearizer lock takes forever
Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
20:39:43.668 (t+65s) Response sent
2. Loading tons of events
Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok
Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit=500, 100, 20 to see if the amount of messages matters on the timing.
Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.
matrix-messages-timing.js
letresultantTimingData={};asyncfunctionfetchAndTimeMessagesFromRoom(roomId,numMessagesToFetch){// via https://stackoverflow.com/a/8084248/796832constrequestId=(Math.random()+1).toString(36).substring(7);letres;lettimeBeforeRes;lettimeAfterRes;letstartTime;letendTime;try{startTime=Date.now();timeBeforeRes=performance.now();res=awaitfetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`,{headers: {'Authorization': `Bearer ${MATRIX_TOKEN}`}});}catch(err){console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`,err);}finally{endTime=Date.now();timeAfterRes=performance.now();}consttraceId=res?.headers?.['synapse-trace-id'];letdata={};if(res?.ok){data=awaitres.json();}consttimingEntry={ok: res?.ok||false,status: res?.status,timing: timeAfterRes-timeBeforeRes,numEventsReturned: data?.chunk?.length,numStateEventsReturned: data?.state?.length,
startTime,
endTime,
requestId,
traceId
};returntimingEntry;}(async()=>{constROOMS_TO_TEST={'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org','openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im','ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im','matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk','MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'};constNUM_MESSAGES_TO_FETCH_MATRIX=[500,100,20];constONE_MINUTE_MS=60*1000;// The `_state_cache` in Synapse has 60 minute expiry so we chose something just biggerconstDELAY_BETWEEN_TRIALS=61*ONE_MINUTE_MS;constREPEAT=5;if(!MATRIX_TOKEN){console.error('MATRIX_TOKEN is not defined. Please define this before running this script');}for(leti=0;i<REPEAT;i++){for(letnumMessagesToFetchofNUM_MESSAGES_TO_FETCH_MATRIX){for(letroomKeyofObject.keys(ROOMS_TO_TEST)){constroomId=ROOMS_TO_TEST[roomKey];consttimingEntry=awaitfetchAndTimeMessagesFromRoom(roomId,numMessagesToFetch);// Default the result map so we don't run into null-pointer exception (NPE) issuesif(!resultantTimingData[roomKey]){resultantTimingData[roomKey]={};}if(!resultantTimingData[roomKey][numMessagesToFetch]){resultantTimingData[roomKey][numMessagesToFetch]=[];}resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`,timingEntry);}// Wait for the caches to clear out before running another trial against the same roomawaitnewPromise((resolve)=>{setTimeout(resolve,DELAY_BETWEEN_TRIALS);})}}})()
This issue has been migrated from #13356.
Background
Traces for reference
These include a bunch of the
/messages
instrumentation added:/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761Why is it slow?
Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11
This part is WIP. Just documenting some of the log diving I've done for
/messages
being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.1. Backfill linearizer lock takes forever
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
117.808s
GET-6890
gitter.ems.host
(setDEBUG
), see this comment.Notes:
117.808s
. Where did the rest of the time go?Timing summary from looking at the logs:
03:42:43.129
(t+0s
): Received request03:42:43.712
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'03:43:26.806
(t+43s
) Acquired linearizer lock 'room_backfill'03:43:26.825
(t+43s
) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill pointsget_oldest_event_ids_with_depth_in_room
only took .12s to get the 4.6k events03:43:28.716
(t+45s
) Asking t2bot.io for backill03:43:28.985
(t+45s
) t2bot.io responded with 100 events03:43:29.009
(t+46s
) Starting to verify content hashes03:43:29.105
(t+46s
) Done verifying content hashes03:43:29.106
(t+46s
) Processing the pulled events03:43:29.147
(t+46s
) Done processing 100 backfilled except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348
(t+46s
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)03:43:34.026
(t+51s
)_get_state_groups_from_groups
start03:43:38.470
(t+55s
)_get_state_groups_from_groups
end03:43:41.112
(t+58s
) Response sentThis isn't a fluke, here is another request I went through the logs on (
GET-196
). This time the duration matched the Synapse logs pretty well:20:38:38.049
(t+0s
) Received request20:38:38.062
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'20:39:23.622
(t+45s
) Acquired linearizer lock 'room_backfill'20:39:23.640
(t+45s
)_maybe_backfill_inner
and pulled 4.6k events out of the database as potential backfill points20:39:25.625
(t+47s
) Asking t2bot.io for backill20:39:35.262
(t+57s
) t2bot.io responded with 100 events20:39:35.283
(t+...
) Starting to verify content hashes20:39:35.382
(t+...
) Done verifying content hashes20:39:35.383
(t+...
) Processing the pulled events20:39:35.424
(t+...
) Done processing 100 backfilled events except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651
(t+...
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)20:39:43.668
(t+65s
) Response sent2. Loading tons of events
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
GET-5541
Notes:
get_aggregation_groups_for_event
ofget_recent_references_for_event
calls (2k). Why?state_events
in the roomTiming summary from looking at the logs:
02:09:51.026
(t+0s
) Received request02:09:51.959
(t+1s
)_maybe_backfill_inner
backfill and pulled 4.6k events out of the database as potential backfill points02:09:52.726
(t+2s
) synapse.storage.databases.main.events_worker Loading 79277 events (why?)02:10:10.462
(t+19s
) Also fetching redaction events02:10:10.462
(t+19s
) Loaded 79277 events02:10:23.246
(t+31s
) Done redacting 105 events (why?)02:10:23.779
(t+31s
) Asking t2bot.io for backill02:10:33.290
(t+41s
) t2bot.io responded with 100 events...
(t+...
) 2k calls toget_recent_references_for_event
andget_aggregation_groups_for_event
(why?)02:10:54.261
(t+63s
) Response sent3. Slow
/state_id
requests but also really slow processing524 timeout: 123.26s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43
123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab
/backfill
request/state_ids
524 timeout: 117.96s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56
117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9
/backfill
request/state_ids
request524 timeout: 115.64s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33
115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae
/backfill
request/state_ids
request200 ok: 83.51s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok
83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282
/backfill
request/state_ids
request200 ok: 75.7s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm
75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c
/backfill
request/state_ids
request/messages
timing scriptEvery 61 minutes (to be outside of the state cache expiry), will call and time
/messages
for each room defined. And will do this with?limit=
500
,100
,20
to see if the amount of messages matters on the timing.Run in the browser. Define
let MATRIX_TOKEN = 'xxx';
before running the script.matrix-messages-timing.js
The text was updated successfully, but these errors were encountered: