Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Backfill gets clogged up with fake prev_events when using the MSC2716 /batch_send endpoint #11091

Closed
Tracked by #10737
MadLittleMods opened this issue Oct 14, 2021 · 0 comments · Fixed by #11114
Closed
Tracked by #10737
Assignees
Labels
A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 14, 2021

Spawning from #11027 (comment)


When using the experimental MSC2716 /batch_send endpoint, we're using a fake prev_event to make the chain of historical state and messages float off on their own in the DAG and nudges federated homeservers to ask for the state at the start of the chunk that we generated.

Each time one of these fake prev_events is processed by the homeserver, it's seen as a backward extremity which we attempt to fetch when trying to do backfill federation things but will never resolve. This is problematic as we will have thousands of batches per room to backfill a big room for Gitter and will have to deal with thousands of backward extremities here. Since we use get_oldest_event_ids_with_depth_in_room and oldest backward extremities will always be our historical events. Assuming we're paginating from the bottom (where current_depth > historical extremity depth), we will always pick the 5 unresolvable fake events in the room to backfill from. This would stop us from backfilling any other events in the room.

Potential solutions

  • Ignore fake event ID's and never put as a backward extremity
    • I don't know how we do this for remote homeservers though.
  • Back-off backfilling a certain event when they fail to fetch
  • Can we just have no prev_events for a given event? How does the first event in the room work?

The solution might be intertwined with #10764.

Reproduction case

Untested for sure (just theoretical): Send more than 5 batches of historical messages. Try to backfill messages from another server.

There is a Complement test which does many batches but doesn't test the backfill aspect, matrix-org/complement#212

@MadLittleMods MadLittleMods added A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Oct 14, 2021
MadLittleMods added a commit that referenced this issue Oct 21, 2021
Fix #11091

We have to allow creation of events with no prev_events
but do have auth_events.

And since the historical member events are outliers
with no prev_events to resolve them, we want to avoid
putting them as backward extremeties.
@MadLittleMods MadLittleMods self-assigned this Feb 4, 2022
MadLittleMods added a commit that referenced this issue Feb 7, 2022
…vers (MSC2716) (#11114)

Fix #11091
Fix #10764 (side-stepping the issue because we no longer have to deal with `fake_prev_event_id`)

 1. Made the `/backfill` response return messages in `(depth, stream_ordering)` order (previously only sorted by `depth`)
    - Technically, it shouldn't really matter how `/backfill` returns things but I'm just trying to make the `stream_ordering` a little more consistent from the origin to the remote homeservers in order to get the order of messages from `/messages` consistent ([sorted by `(topological_ordering, stream_ordering)`](https://github.com/matrix-org/synapse/blob/develop/docs/development/room-dag-concepts.md#depth-and-stream-ordering)).
    - Even now that we return backfilled messages in order, it still doesn't guarantee the same `stream_ordering` (and more importantly the [`/messages` order](https://github.com/matrix-org/synapse/blob/develop/docs/development/room-dag-concepts.md#depth-and-stream-ordering)) on the other server. For example, if a room has a bunch of history imported and someone visits a permalink to a historical message back in time, their homeserver will skip over the historical messages in between and insert the permalink as the next message in the `stream_order` and totally throw off the sort.
       - This will be even more the case when we add the [MSC3030 jump to date API endpoint](matrix-org/matrix-spec-proposals#3030) so the static archives can navigate and jump to a certain date.
       - We're solving this in the future by switching to [online topological ordering](matrix-org/gomatrixserverlib#187) and [chunking](#3785) which by its nature will apply retroactively to fix any inconsistencies introduced by people permalinking
 2. As we're navigating `prev_events` to return in `/backfill`, we order by `depth` first (newest -> oldest) and now also tie-break based on the `stream_ordering` (newest -> oldest). This is technically important because MSC2716 inserts a bunch of historical messages at the same `depth` so it's best to be prescriptive about which ones we should process first. In reality, I think the code already looped over the historical messages as expected because the database is already in order.
 3. Making the historical state chain and historical event chain float on their own by having no `prev_events` instead of a fake `prev_event` which caused backfill to get clogged with an unresolvable event. Fixes #11091 and #10764
 4. We no longer find connected insertion events by finding a potential `prev_event` connection to the current event we're iterating over. We now solely rely on marker events which when processed, add the insertion event as an extremity and the federating homeserver can ask about it when time calls.
    - Related discussion, #11114 (comment)


Before | After
--- | ---
![](https://user-images.githubusercontent.com/558581/139218681-b465c862-5c49-4702-a59e-466733b0cf45.png) | ![](https://user-images.githubusercontent.com/558581/146453159-a1609e0a-8324-439d-ae44-e4bce43ac6d1.png)



#### Why aren't we sorting topologically when receiving backfill events?

> The main reason we're going to opt to not sort topologically when receiving backfill events is because it's probably best to do whatever is easiest to make it just work. People will probably have opinions once they look at [MSC2716](matrix-org/matrix-spec-proposals#2716) which could change whatever implementation anyway.
> 
> As mentioned, ideally we would do this but code necessary to make the fake edges but it gets confusing and gives an impression of “just whyyyy” (feels icky). This problem also dissolves with online topological ordering.
>
> -- #11114 (comment)

See #11114 (comment) for the technical difficulties
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
1 participant