Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why /messages is slow #13356

Open
matrixbot opened this issue Dec 20, 2023 · 0 comments
Open

Investigate why /messages is slow #13356

matrixbot opened this issue Dec 20, 2023 · 0 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 20, 2023

This issue has been migrated from #13356.


Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

  • ?limit=500: 111.81s (492 events / 129 state events)
  • ?limit=100: 119.504s (92 events / 50 state events)
  • ?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, matrix-org/synapse#10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

  • Room: #matrix:matrix.org
  • API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
  • Request -> response duration: 117.808s
  • Logs in Kibana: GET-6890
    • For reference on adjusting the logging config for gitter.ems.host (set DEBUG), see this comment.

Notes:

  • Why does linearizer lock take ~45s to aquire lock?
  • It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
  • For me on the outside, the request took 117.808s. Where did the rest of the time go?
    • Delay in connecting to the server?
    • Delay in serializing the response?
    • Need to time the requests more precisely when the started and ended to compare against the Synapse logs

Timing summary from looking at the logs:

  1. 03:42:43.129 (t+0s ): Received request
  2. 03:42:43.712 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
  3. 03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
  4. 03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
    • get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
  5. 03:43:28.716 (t+45s) Asking t2bot.io for backill
  6. 03:43:28.985 (t+45s) t2bot.io responded with 100 events
  7. 03:43:29.009 (t+46s) Starting to verify content hashes
  8. 03:43:29.105 (t+46s) Done verifying content hashes
  9. 03:43:29.106 (t+46s) Processing the pulled events
  10. 03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 03:43:34.026 (t+51s) _get_state_groups_from_groups start
  13. 03:43:38.470 (t+55s) _get_state_groups_from_groups end
  14. 03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs on (GET-196). This time the duration matched the Synapse logs pretty well:

  1. 20:38:38.049 (t+0s ) Received request
  2. 20:38:38.062 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
  3. 20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
  4. 20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
  5. 20:39:25.625 (t+47s) Asking t2bot.io for backill
  6. 20:39:35.262 (t+57s) t2bot.io responded with 100 events
  7. 20:39:35.283 (t+...) Starting to verify content hashes
  8. 20:39:35.382 (t+...) Done verifying content hashes
  9. 20:39:35.383 (t+...) Processing the pulled events
  10. 20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 20:39:43.668 (t+65s) Response sent

2. Loading tons of events

  • Room: #matrix:matrix.org
  • API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
  • Request -> response duration: ~60s
  • Logs in Kibana: GET-5541

Notes:

  • Tons of get_aggregation_groups_for_event of get_recent_references_for_event calls (2k). Why?
  • Why did we load 79k events?
    • This lines up with the number of state_events in the room
  • Why did we redact some events? Is this redacting events in our response or in general?
  • Why didn't these things happen in the other calls above?

Timing summary from looking at the logs:

  1. 02:09:51.026 (t+0s ) Received request
  2. 02:09:51.959 (t+1s ) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
  3. 02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
  4. 02:10:10.462 (t+19s) Also fetching redaction events
  5. 02:10:10.462 (t+19s) Loaded 79277 events
  6. 02:10:23.246 (t+31s) Done redacting 105 events (why?)
  7. 02:10:23.779 (t+31s) Asking t2bot.io for backill
  8. 02:10:33.290 (t+41s) t2bot.io responded with 100 events
  9. ... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
  10. 02:10:54.261 (t+63s) Response sent

3. Slow /state_id requests but also really slow processing

524 timeout: 123.26s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43

123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab

  • 19.24s /backfill request
  • 53.45s /state_ids
524 timeout: 117.96s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56

117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9

  • 921ms /backfill request
  • 47.24s /state_ids request
524 timeout: 115.64s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33

{
    "ok": false,
    "timing": 99185.70000004768,
    "startTime": 1658527353492,
    "endTime": 1658527452680,
    "requestId": "p8c09g"
}

115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae

  • 14.49s /backfill request
  • 54.47s /state_ids request
200 ok: 83.51s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok

{
    "ok": true,
    "status": 200,
    "timing": 83771.90000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658527887072,
    "endTime": 1658527970847,
    "requestId": "cjfpw",
    "traceId": "ae7c694e57113282"
}

83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282

  • 526ms /backfill request
  • 42.77s /state_ids request
200 ok: 75.7s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm

{
    "ok": true,
    "status": 200,
    "timing": 76131.40000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658528812471,
    "endTime": 1658528888604,
    "requestId": "wg6g8k"
    "traceId": "d048d9f59c20555c"
}

75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c

  • 549ms /backfill request
  • 32.58s /state_ids request

/messages timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js
let resultantTimingData = {};
  
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
  // via https://stackoverflow.com/a/8084248/796832
  const requestId = (Math.random() + 1).toString(36).substring(7);

  let res;
  let timeBeforeRes;
  let timeAfterRes;
  let startTime;
  let endTime;
  try {
    startTime = Date.now();
    timeBeforeRes = performance.now();
    res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
      headers: {
        'Authorization': `Bearer ${MATRIX_TOKEN}`
      }
    });
  } catch(err) {
    console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
  } finally {
    endTime = Date.now();
    timeAfterRes = performance.now();
    }
  
  const traceId = res?.headers?.['synapse-trace-id'];
  let data = {};
  if (res?.ok) {
    data = await res.json();
  }

  const timingEntry = {
    ok: res?.ok || false,
    status: res?.status,
    timing: timeAfterRes - timeBeforeRes,
    numEventsReturned: data?.chunk?.length,
    numStateEventsReturned: data?.state?.length,
    startTime,
    endTime,
    requestId,
    traceId
  };

  return timingEntry;
}

(async () => {
  const ROOMS_TO_TEST = {
    'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
    'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
    'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
    'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
    'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
  };

  const NUM_MESSAGES_TO_FETCH_MATRIX = [
    500,
    100,
    20
  ];

  const ONE_MINUTE_MS = 60 * 1000;
  // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
  const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;

  const REPEAT = 5;

  if (!MATRIX_TOKEN) {
    console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
  }

  for (let i = 0; i < REPEAT; i++) {
    for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
      for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
        const roomId = ROOMS_TO_TEST[roomKey];

        const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);


        // Default the result map so we don't run into null-pointer exception (NPE) issues
        if (!resultantTimingData[roomKey]) {
          resultantTimingData[roomKey] = {};
        }

        if (!resultantTimingData[roomKey][numMessagesToFetch]) {
          resultantTimingData[roomKey][numMessagesToFetch] = [];
        }

        resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);

        console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
      }

      // Wait for the caches to clear out before running another trial against the same room
      await new Promise((resolve) => {
        setTimeout(resolve, DELAY_BETWEEN_TRIALS);
      })
    }
  }
})()
function getAvg(roomId, numMessages) {
    const timingEntries = data1[roomId][numMessages];
    return timingEntries
        .filter((timingEntry) => timingEntry.ok)
        .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}

getAvg('matrix:matrix.org', "500")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant