Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope the connections to devices and not to a hash of the access token #51

Closed
sandhose opened this issue Mar 30, 2023 · 15 comments
Closed
Assignees
Labels
Milestone

Comments

@sandhose
Copy link
Member

sandhose commented Mar 30, 2023

It looks like some stuff is scoped to a hash of the access token, which will become a problem when we have refreshing access tokens, either via MSC2918 or with OIDC.

func HashedTokenFromRequest(req *http.Request) (hashAccessToken string, accessToken string, err error) {
// return a hash of the access token
ah := req.Header.Get("Authorization")
if ah == "" {
return "", "", fmt.Errorf("missing Authorization header")
}
accessToken = strings.TrimPrefix(ah, "Bearer ")
// important that this is a cryptographically secure hash function to prevent
// preimage attacks where Eve can use a fake token to hash to an existing device ID
// on the server.
hash := sha256.New()
hash.Write([]byte(accessToken))
return hex.EncodeToString(hash.Sum(nil)), accessToken, nil
}

It should be using the device_id instead, which are stable even when the token refreshes

Maybe related to #22?

Some (internal) discussions about that:

@kegsay
Copy link
Member

kegsay commented Mar 30, 2023

MSC3970 changes the scope of transaction IDs to be based on device ID instead of access token. how might this impact SS or not?

native SS implementations will treat it however homeservers do it today. Native SS doesn't modify the event in any special way. That being said, the proxy implementation DOES edit the transaction ID because that's how it can de-duplicate room data as it's multi-tenant with other users. It currently does use the access_token as a device ID, and this assumption is baked into most of the proxy as it predates the ability to get the device ID from /whoami

how does auth work with the SS proxy and if the access token changes (i.e. a new one issued from a refresh token) will it have any impact?

it would be treated as a new device for the same user, causing a new poller to be created and an initial sync to be performed. The proxy will still be able to serve up data for the user's account instantly due to having some data from the previous token, so in theory it should Just Work with the caveat that it might take fractionally longer to sync the first time you use the new token, though I have not tested this. I assume the old token will 401/403, which will then cause the old poll loop to terminate which will then tear down the old connection. ah, though the to-device table is keyed off SHA256(access_token) so there will be a race condition where if a new to-device event arrived and before it got delivered to the client the token was invalidated, then it will be permanently lost. which isn't great. We'd need to key off a static identifier e.g SHA256(user_id $DELIMITER device_id)

@DMRobertson
Copy link
Contributor

Note that this is an internal-only change, transparent to downstream clients and upstream homeservers.

We'd need to key off a static identifier e.g SHA256(user_id $DELIMITER device_id)

Alternatively we could just store user_id $DELIMITER device_id directly. This has the benefit of making debugging slightly easier.

For this to work, the homeserver must consistently return a user_id with the same capitalisation (legacy mxids etc grrr). We also want the user id and device ids to have bounded length. Not sure if device ids have a maximum length in the spec? If not, we can truncate the device ID to a sensibly-sized prefix.

@kegsay kegsay added this to the v1 milestone Apr 5, 2023
@DMRobertson DMRobertson self-assigned this Apr 11, 2023
@DMRobertson
Copy link
Contributor

DMRobertson commented Apr 11, 2023

We also want the user id and device ids to have bounded length.

On matrix IDs:

The length of a user ID, including the @ sigil and the domain, MUST NOT exceed 255 characters.

We'd have to choose DELIMITER to be something compatible with the historical user ids, which means it'll need to be a nonprinting ascii character.

Not sure if device ids have a maximum length in the spec?

Not that I can see. A quick look on Matrix.org shows that almost all (99.9995%) of device IDs are <= 43 characters (codepoints) long. But a large chunk of those are for signing keys, which aren't ever going to be used for syncing. Excluding singing key devices, almost all (99.9990%) devices are <= 16 characters (codepoints) long.

DMRobertson pushed a commit that referenced this issue Apr 11, 2023
Useful for #51, small enough to include in isolation
@DMRobertson

This comment was marked as outdated.

@DMRobertson

This comment was marked as outdated.

@DMRobertson

This comment was marked as outdated.

@DMRobertson
Copy link
Contributor

DMRobertson commented Apr 12, 2023

Notes:

  • The proxy doesn't care about the distinction between user id and hs device id. All the other tables are keyed off one text field so it really will be easier to concate $USER$DELIMITER$DEVICE.
  • One-off migration at startup to rewrite all tables. Write it in go code, make /whoami calls as needed.
  • In a refresh tokens world we do need to be aware that tokens will expire underneath us. So we basically need to solve Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18 as a prerequisite.
    • We hold onto the since tokens. Try an incremental sync afterwards, if it's not gappy then all good; if it is gappy, drop everything and initial sync.
    • Problem here: the proxy assumes that if we have room state in the proxy's DB, then it is up to date. Everything is based on the assumption of a single contiguous room timeline.
    • If we commit to refreshing access tokens (and/or to fixing Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18) then that assumption is broken: the proxy might have a stale snapshot of room state. (We might get away with this if there have been only a few (10) events in the room and the poller doesn't get a gappy sync response---but that is not guaranteed. Could bump up that limit.)
    • We suspect that we already have problems along these lines.

Could do:

  • Whenever a poller sees a state block in an incremental sync v2 response, fetch all of its events from the proxy's database by event IDs. If they all have NIDs they have already been processed by the proxy, and all is well. If there are some events missing a NID, prepend them to the timeline. (The timeline order will be wrong, but at least both the proxy and the client will now see up to date state---we've self-healed.) This needs to all happen within a single database transaction. This is also a bad place to be in so needs to be loudly logged and Sentried. something something eventually consistent.
  • The above doesn't account for missing message events. One might think that the proxy could backfill these (prev_batch etc) but that a) would be slow and b) would fetch events in topological ordering, not stream ordering. That breaks the assumptions of the state update logic.

@DMRobertson
Copy link
Contributor

@sandhose pointed out to me that matrix-org/matrix-spec-proposals#3970 has landed---transaction IDs are scoped to devices (meaning userID + deviceID) and not "client session". I think the proxy was already effectively doing this, but will need to check and probably write a test for this.

@DMRobertson
Copy link
Contributor

Problem: how do we distinguish between a device expiry and a token expiry?

If we don't, we'll end up keeping device-scoped data forever, in particular, to-device messages. This will be A Problem.

@sandhose
Copy link
Member Author

I think one potential solution would be to make the proxy OIDC-native.
The general idea is that when there is an incoming access token, it can do the introspection itself to know the mxid + device_id, and exchange that token for a longer-lived token that refreshes independently from the client access token, but where the session is still linked to it. If the client session ends, the derived token will also be revoked.

The MAS side is not really ready for that though, so it depends how much of a problem this ends up being? We could also imagine a (temporary) hacky solution, where MAS calls some special endpoint on the proxy when a device gets deleted?

@DMRobertson
Copy link
Contributor

Problem: how do we distinguish between a device expiry and a token expiry?

Maybe it's sufficient to look for soft or hard logouts, and expire entire devices on hard logouts?

@sandhose
Copy link
Member Author

Long story short: soft-logout is not really usable in your case (and with Matrix refreshing tokens) and for a simple reason: the client should be able to track whenever their access tokens is about to expire, and the soft-logout would happen when it tries to refresh the token, which of course the proxy doesn't know about. The old, expired access tokens won't have a soft_logout: true in the error response.

@hughns
Copy link
Member

hughns commented May 26, 2023

Some alternatives that we can suggest for determining when device state can be purged:

  • having the HS explicitly call the proxy to tell it that a device has logged out
  • in the case of the proxy having more than one device for a user: periodically query the /devices endpoint to get list of active devices for a user and purge any that are no longer valid
  • purge any device after a period of inactivity (this might be sensible as a fallback anyhow?)
  • random thought: is the any "public" information like E2EE device keys that could be used to infer the validity of another user's device?

In the future OIDC model we have discussed where the proxy would exchange the access token received from the client with its own access/refresh token it would be handled nicely: when the original client is logged out the corresponding (exchanged) access token would also be invalidated. So, the proxy could purge data once its own access/refresh token is no longer valid.

@kegsay
Copy link
Member

kegsay commented Jul 12, 2023

Why is this issue still open? We do scope to devices now. Anything else should be a separate issue (e.g cleaning up stale devices)

@kegsay
Copy link
Member

kegsay commented Jul 12, 2023

#127 was already made, so closing this.

@kegsay kegsay closed this as completed Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants