Improve API/event validation in synapse #8445

richvdh · 2020-10-01T23:16:29Z

Background

We've recently encountered a number of bugs in which malformed (or at least, unexpected) data has caused Synapse or clients to misbehave in some way. These bugs stem from the fact that, faced with a given datastructure, you cannot rely on it having the expected format. For example:

What happens if the displayname in an m.room.member event is not a string (User directory gets stuck when encountering non-string display name #8220, Sqlite Error: Error binding parameter 1 - probably unsupported type (when joining a room) #8340)?
What happens if an event object is missing its origin field (Failing to join / send_join rooms: FrozenEventV3 has no 'origin' property #8319)?

Such bugs are disruptive, and in extreme cases could progress beyond "denial of service" to "security threat", and it might be possible to avoid them by validating data at the point of entry to Synapse. We've recently discussed this in some depth within the core Synapse team; this issue serves to record some of our thoughts on the topic, including promising areas for further development.

Introduction

There are actually multiple reasons why it would be useful to improve validation of data within Synapse. These include:

As above, avoiding potential bugs
Simplifying code by removing isinstance checks everywhere
Forcing clients to follow the specification rather than using "whatever works in Synapse", thus improving compatibility with other server implementations which do enforce the spec (or conversely: not forcing other implementations to grandfather in Synapse's misfeatures for compatibility with clients)
Making synapse return 400 errors rather than "Internal Server Errors" if clients pass invalid parameters would:
- Give better feedback to client developers: the 500 error gives the developer little information about their error.)
- Allow server administrators to distinguish between "my server is having problems" and "somebody passed invalid parameters to my server"
- Allow load-balancers to distinguish between "this process is having problems" and "somebody passed invalid parameters". Notably, Cloudflare will stop sending requests to an origin which returns too many 500 errors.

There are multiple related, but different, things we mean when we talk about validation. At a high level, these can be broken down into:

validating API request parameters (both CS and SS API); and
validating events (and other such data types, like EDUs etc).

Let's consider these in turn.

API validation

Given that we already have JSON schema specifications for our APIs, this is theoretically relatively straightforward (see, for example c39941c, which is a proof-of-concept applying this to a single endpoint), though it's certain to uncover a large number of places where clients are relying on non-spec-compliant behaviour. We also have to be wary, especially on the SS API, to ensure backwards compatibility.

Doing this would certainly reduce the number of false 500 errors, which as above brings a number of advantages. It might also reduce the occurrences of bugs due to bad events (e.g. non-string display names) since updated Synapses will correctly reject them on the CS API; however, it will absolutely not fix those bugs, since such malformed data could still be received from buggy or malicious servers over federation.

Event validation

Validating events, particularly those received over federation, is quite hard as:

you need to decide what to do with events that fail validation,
retaining backwards compatibility means it’s hard to reject/drop/decline to handle events that fail strict validation.

For example, an event with malformed m.relates_to data can’t just be dropped as, according to the authentication rules of all room versions to date, it is a valid event, even though its payload (that Synapse still interprets) is invalid, since annotations were added after the current room versions were specced.

The main problem we have currently is that events contain a mix of properties which are validated on receipt (auth_events, prev_events, etc) alongside a bunch of untrusted data that we cannot assume nor assert the types of. Then, when we come to access event data, we need to remember to add checks for any untrusted data. While room versions allow us to add stricter schemas, relying on that approach will always be a case of playing catchup as we’ll want to use new features in older room versions where possible. (See also MSC2801 on this subject.)

Ideally, therefore, we’d try and add some tooling to make it easier to statically assert (or at least tell easily in the code) whether the fields you’re accessing on events (and other data types) have been validated or not. Hopefully it is possibly to do something with mypy here, however it will require some experimentation here.

Conclusions

In summary there are a few paths going forward:

validate schemas of APIs;
validate a basic schema for events (and other data types), potentially using room versions to allow us to enforce stricter schemas while retaining backwards compatibilty; and
experiment with mypy and other tooling (as well as code changes) to try and highlight the difference between validated and unvalidated data in events, and to enforce correct validating when using untrusted data.

Both 1 and 2 would be good things to do, and may reduce the number of occurrences of bugs, but won’t actually fix the class of bugs we’re seeing due to unexpected formats of various keys in events. However, while paths 1 and 2 are something that we know how to do, path 3 has a lot more unknowns attached to it, but ultimately is the only option that will fully prevent the class of bugs we’re seeing.

The text was updated successfully, but these errors were encountered:

ShadowJonathan · 2020-10-02T11:48:11Z

Small note about 3: mypy could be a good fit to enforce internal consistency of data access after validation checks have passed, by annotating return types with TypedDict or @attr.s classes, with which other areas of the program can then work guaranteed with validated and enforced data structures. Also, after validation, passed data structure (but not content) should be immutable, only transformed to database-ready types or dicts (for example) for serialisation. (#8376 could be relevant to this point)

Additionally; Should we use 400 when schema validation fails, and 422 (Unprocessable Entity) when event validation fails? Should that be reflected back into the spec?

(422 according to the mozilla documentation: "The HyperText Transfer Protocol (HTTP) 422 Unprocessable Entity response status code indicates that the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.")

bmarty · 2021-02-16T21:04:54Z

I will not create a dedicated issue, but for instance Synapse CS API will accept if a client send a state event with un-specified value. For instance using "foo" for the "join_rule" field of https://matrix.org/docs/spec/client_server/r0.6.1#m-room-join-rules

This break compatibility on some clients (ex: on Element Android this Event is ignored and not displayed in the timeline), and I'm not sure what happen server side regarding the room properties and what should be the fallback we display to the user in this case.

At least on the CS API I would expect to get an error 400 when trying to send such malformed event.

ShadowJonathan · 2021-07-09T11:41:18Z

I'd also like to mention pydantic here as an interesting option, as it provides speedy (C-extension-backed) validation of objects.

clokep · 2021-08-12T17:32:13Z

(see, for example c39941c, which is a proof-of-concept applying this to a single endpoint)

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower). I had put together a benchmark script at https://gist.github.com/clokep/17064b16a9b471d09feb3e61851886af

Results from running this on my 2016 MBP with Python 3.9.6 (times in nanoseconds):

Test	Manual	JSON Schema
Empty	305	14200
Error	309	39800
Correct	294	22600

Raw results

.....................
WARNING: the benchmark result may be unstable
* the standard deviation (3.38 us) is 24% of the mean (14.2 us)
* the maximum (31.5 us) is 121% greater than the mean (14.2 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

json_schema_empty: Mean +- std dev: 14.2 us +- 3.4 us
.....................
manual_empty: Mean +- std dev: 305 ns +- 21 ns
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (5.16 us) is 13% of the mean (39.8 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

json_schema_wrong_type: Mean +- std dev: 39.8 us +- 5.2 us
.....................
WARNING: the benchmark result may be unstable
* the maximum (475 ns) is 54% greater than the mean (309 ns)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

manual_wrong_type: Mean +- std dev: 309 ns +- 27 ns
.....................
json_schema_good: Mean +- std dev: 22.6 us +- 1.4 us
.....................
manual_good: Mean +- std dev: 294 ns +- 11 ns

DMRobertson · 2021-08-12T17:47:02Z

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

clokep · 2021-08-12T18:02:58Z

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

Yes, that's what I was attempting to benchmark. 😄 (It also is completely fake -- the schemas of most of our endpoints are much more complicated.) As I had said in chat, I didn't spend too much time on this and was attempting to get a gut of whether it would make sense to drop a lot more time into it or not.

* Validate device_keys for C-S /keys/query requests Closes #10354 A small, not particularly critical fix. I'm interested in seeing if we can find a more systematic approach though. #8445 is the place for any discussion.

Kay-kay2019 · 2022-02-12T05:22:14Z

Working it

jaywink · 2022-04-14T09:16:35Z

One case I ran into - Synapse accepts an object into the allow of join rules, when it should be an array of objects.

richvdh mentioned this issue Oct 1, 2020

Centralize event/object schema checking #8361

Closed

clokep added the A-Validation 500 (mostly) errors due to lack of event/parameter validation label Oct 2, 2020

grinapo mentioned this issue Jan 30, 2021

power_level event lack of validation of numeric input #9267

Closed

ShadowJonathan mentioned this issue Jul 9, 2021

EventBase and subtypes don't have their DictProperty types validated #8570

Closed

erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 26, 2021

DMRobertson mentioned this issue Aug 12, 2021

Synapse accepts objects in place of device ids for /_matrix/client/r0/keys/query #10354

Closed

DMRobertson mentioned this issue Aug 12, 2021

Validate device_keys for C-S /keys/query requests #10593

Merged

DMRobertson mentioned this issue Mar 21, 2022

Always allow the empty string as an avatar_url. #12261

Merged

syldrathecat mentioned this issue Jun 5, 2022

JSON type errors cause events to be discarded, creating incompatibility with Element and other clients Nheko-Reborn/mtxclient#77

Open

DMRobertson added this to the Validate data passed to REST endpoints milestone Jun 30, 2022

clokep mentioned this issue Oct 18, 2022

Synapse allows users to create m.space.child state events where via is a JSON object instead of an array #14218

Closed

matrixbot mentioned this issue Dec 21, 2023

Improve API/event validation in synapse element-hq/synapse#8445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve API/event validation in synapse #8445

Improve API/event validation in synapse #8445

richvdh commented Oct 1, 2020

ShadowJonathan commented Oct 2, 2020 •

edited

Loading

bmarty commented Feb 16, 2021

ShadowJonathan commented Jul 9, 2021

clokep commented Aug 12, 2021 •

edited

Loading

DMRobertson commented Aug 12, 2021

clokep commented Aug 12, 2021

Kay-kay2019 commented Feb 12, 2022

jaywink commented Apr 14, 2022

Improve API/event validation in synapse #8445

Improve API/event validation in synapse #8445

Comments

richvdh commented Oct 1, 2020

Background

Introduction

API validation

Event validation

Conclusions

ShadowJonathan commented Oct 2, 2020 • edited Loading

bmarty commented Feb 16, 2021

ShadowJonathan commented Jul 9, 2021

clokep commented Aug 12, 2021 • edited Loading

DMRobertson commented Aug 12, 2021

clokep commented Aug 12, 2021

Kay-kay2019 commented Feb 12, 2022

jaywink commented Apr 14, 2022

ShadowJonathan commented Oct 2, 2020 •

edited

Loading

clokep commented Aug 12, 2021 •

edited

Loading