Investigate why `/messages` is slow #13356

matrixbot · 2023-12-20T11:09:34Z

This issue has been migrated from #13356.

Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

?limit=500: 111.81s (492 events / 129 state events)

?limit=100: 119.504s (92 events / 50 state events)

?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, matrix-org/synapse#10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

144.93s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb
135.3s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d
154.72s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3
185.9s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: 117.808s
Logs in Kibana: GET-6890
- For reference on adjusting the logging config for gitter.ems.host (set DEBUG), see this comment.

Notes:

Why does linearizer lock take ~45s to aquire lock?
It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
For me on the outside, the request took 117.808s. Where did the rest of the time go?
- Delay in connecting to the server?
- Delay in serializing the response?
- Need to time the requests more precisely when the started and ended to compare against the Synapse logs

Timing summary from looking at the logs:

03:42:43.129 (t+0s ): Received request
03:42:43.712 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
- get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
03:43:28.716 (t+45s) Asking t2bot.io for backill
03:43:28.985 (t+45s) t2bot.io responded with 100 events
03:43:29.009 (t+46s) Starting to verify content hashes
03:43:29.105 (t+46s) Done verifying content hashes
03:43:29.106 (t+46s) Processing the pulled events
03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
03:43:34.026 (t+51s) _get_state_groups_from_groups start
03:43:38.470 (t+55s) _get_state_groups_from_groups end
03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs on (GET-196). This time the duration matched the Synapse logs pretty well:

20:38:38.049 (t+0s ) Received request
20:38:38.062 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
20:39:25.625 (t+47s) Asking t2bot.io for backill
20:39:35.262 (t+57s) t2bot.io responded with 100 events
20:39:35.283 (t+...) Starting to verify content hashes
20:39:35.382 (t+...) Done verifying content hashes
20:39:35.383 (t+...) Processing the pulled events
20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
20:39:43.668 (t+65s) Response sent

2. Loading tons of events

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: ~60s
Logs in Kibana: GET-5541
- 1
- 2

Notes:

Tons of get_aggregation_groups_for_event of get_recent_references_for_event calls (2k). Why?
Why did we load 79k events?
- This lines up with the number of state_events in the room
Why did we redact some events? Is this redacting events in our response or in general?
Why didn't these things happen in the other calls above?

Timing summary from looking at the logs:

02:09:51.026 (t+0s ) Received request
02:09:51.959 (t+1s ) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
02:10:10.462 (t+19s) Also fetching redaction events
02:10:10.462 (t+19s) Loaded 79277 events
02:10:23.246 (t+31s) Done redacting 105 events (why?)
02:10:23.779 (t+31s) Asking t2bot.io for backill
02:10:33.290 (t+41s) t2bot.io responded with 100 events
... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
02:10:54.261 (t+63s) Response sent

3. Slow `/state_id` requests but also really slow processing

524 timeout: 123.26s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43

123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab

19.24s /backfill request
53.45s /state_ids

524 timeout: 117.96s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56

117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9

921ms /backfill request
47.24s /state_ids request

524 timeout: 115.64s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33

{
    "ok": false,
    "timing": 99185.70000004768,
    "startTime": 1658527353492,
    "endTime": 1658527452680,
    "requestId": "p8c09g"
}

115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae

14.49s /backfill request
54.47s /state_ids request

200 ok: 83.51s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok

{
    "ok": true,
    "status": 200,
    "timing": 83771.90000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658527887072,
    "endTime": 1658527970847,
    "requestId": "cjfpw",
    "traceId": "ae7c694e57113282"
}

83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282

526ms /backfill request
42.77s /state_ids request

200 ok: 75.7s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm

{
    "ok": true,
    "status": 200,
    "timing": 76131.40000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658528812471,
    "endTime": 1658528888604,
    "requestId": "wg6g8k"
    "traceId": "d048d9f59c20555c"
}

75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c

549ms /backfill request
32.58s /state_ids request

`/messages` timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js

let resultantTimingData = {};
  
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
  // via https://stackoverflow.com/a/8084248/796832
  const requestId = (Math.random() + 1).toString(36).substring(7);

  let res;
  let timeBeforeRes;
  let timeAfterRes;
  let startTime;
  let endTime;
  try {
    startTime = Date.now();
    timeBeforeRes = performance.now();
    res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
      headers: {
        'Authorization': `Bearer ${MATRIX_TOKEN}`
      }
    });
  } catch(err) {
    console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
  } finally {
    endTime = Date.now();
    timeAfterRes = performance.now();
    }
  
  const traceId = res?.headers?.['synapse-trace-id'];
  let data = {};
  if (res?.ok) {
    data = await res.json();
  }

  const timingEntry = {
    ok: res?.ok || false,
    status: res?.status,
    timing: timeAfterRes - timeBeforeRes,
    numEventsReturned: data?.chunk?.length,
    numStateEventsReturned: data?.state?.length,
    startTime,
    endTime,
    requestId,
    traceId
  };

  return timingEntry;
}

(async () => {
  const ROOMS_TO_TEST = {
    'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
    'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
    'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
    'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
    'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
  };

  const NUM_MESSAGES_TO_FETCH_MATRIX = [
    500,
    100,
    20
  ];

  const ONE_MINUTE_MS = 60 * 1000;
  // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
  const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;

  const REPEAT = 5;

  if (!MATRIX_TOKEN) {
    console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
  }

  for (let i = 0; i < REPEAT; i++) {
    for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
      for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
        const roomId = ROOMS_TO_TEST[roomKey];

        const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);


        // Default the result map so we don't run into null-pointer exception (NPE) issues
        if (!resultantTimingData[roomKey]) {
          resultantTimingData[roomKey] = {};
        }

        if (!resultantTimingData[roomKey][numMessagesToFetch]) {
          resultantTimingData[roomKey][numMessagesToFetch] = [];
        }

        resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);

        console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
      }

      // Wait for the caches to clear out before running another trial against the same room
      await new Promise((resolve) => {
        setTimeout(resolve, DELAY_BETWEEN_TRIALS);
      })
    }
  }
})()

function getAvg(roomId, numMessages) {
    const timingEntries = data1[roomId][numMessages];
    return timingEntries
        .filter((timingEntry) => timingEntry.ok)
        .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}

getAvg('matrix:matrix.org', "500")

The text was updated successfully, but these errors were encountered:

matrixbot closed this as completed Dec 20, 2023

matrixbot changed the title ~~Dummy issue~~ Investigate why /messages is slow Dec 21, 2023

matrixbot added A-Performance A-Messages-Endpoint T-Other labels Dec 21, 2023

matrixbot reopened this Dec 21, 2023

ara4n mentioned this issue Dec 2, 2024

Synapse takes forever to send large responses - dead time gap after encode_json_response #17722

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why `/messages` is slow #13356

Investigate why `/messages` is slow #13356

matrixbot commented Dec 20, 2023 •

edited

Loading

Investigate why /messages is slow #13356

Investigate why /messages is slow #13356

Comments

matrixbot commented Dec 20, 2023 • edited Loading

Background

Traces for reference

Why is it slow?

1. Backfill linearizer lock takes forever

2. Loading tons of events

3. Slow /state_id requests but also really slow processing

/messages timing script

Investigate why `/messages` is slow #13356

Investigate why `/messages` is slow #13356

matrixbot commented Dec 20, 2023 •

edited

Loading

3. Slow `/state_id` requests but also really slow processing

`/messages` timing script