Scope the connections to devices and not to a hash of the access token #51

sandhose · 2023-03-30T09:35:53Z

It looks like some stuff is scoped to a hash of the access token, which will become a problem when we have refreshing access tokens, either via MSC2918 or with OIDC.

sliding-sync/internal/request.go

Lines 11 to 24 in 2c91fcb

    
           func HashedTokenFromRequest(req *http.Request) (hashAccessToken string, accessToken string, err error) { 
        
           	// return a hash of the access token 
        
           	ah := req.Header.Get("Authorization") 
        
           	if ah == "" { 
        
           		return "", "", fmt.Errorf("missing Authorization header") 
        
           	} 
        
           	accessToken = strings.TrimPrefix(ah, "Bearer ") 
        
           	// important that this is a cryptographically secure hash function to prevent 
        
           	// preimage attacks where Eve can use a fake token to hash to an existing device ID 
        
           	// on the server. 
        
           	hash := sha256.New() 
        
           	hash.Write([]byte(accessToken)) 
        
           	return hex.EncodeToString(hash.Sum(nil)), accessToken, nil 
        
           }

It should be using the device_id instead, which are stable even when the token refreshes

Maybe related to #22?

Some (internal) discussions about that:

kegsay · 2023-03-30T10:15:12Z

MSC3970 changes the scope of transaction IDs to be based on device ID instead of access token. how might this impact SS or not?

native SS implementations will treat it however homeservers do it today. Native SS doesn't modify the event in any special way. That being said, the proxy implementation DOES edit the transaction ID because that's how it can de-duplicate room data as it's multi-tenant with other users. It currently does use the access_token as a device ID, and this assumption is baked into most of the proxy as it predates the ability to get the device ID from /whoami

how does auth work with the SS proxy and if the access token changes (i.e. a new one issued from a refresh token) will it have any impact?

it would be treated as a new device for the same user, causing a new poller to be created and an initial sync to be performed. The proxy will still be able to serve up data for the user's account instantly due to having some data from the previous token, so in theory it should Just Work with the caveat that it might take fractionally longer to sync the first time you use the new token, though I have not tested this. I assume the old token will 401/403, which will then cause the old poll loop to terminate which will then tear down the old connection. ah, though the to-device table is keyed off SHA256(access_token) so there will be a race condition where if a new to-device event arrived and before it got delivered to the client the token was invalidated, then it will be permanently lost. which isn't great. We'd need to key off a static identifier e.g SHA256(user_id $DELIMITER device_id)

DMRobertson · 2023-03-30T10:52:54Z

Note that this is an internal-only change, transparent to downstream clients and upstream homeservers.

We'd need to key off a static identifier e.g SHA256(user_id $DELIMITER device_id)

Alternatively we could just store user_id $DELIMITER device_id directly. This has the benefit of making debugging slightly easier.

For this to work, the homeserver must consistently return a user_id with the same capitalisation (legacy mxids etc grrr). We also want the user id and device ids to have bounded length. Not sure if device ids have a maximum length in the spec? If not, we can truncate the device ID to a sensibly-sized prefix.

DMRobertson · 2023-04-11T15:16:20Z

We also want the user id and device ids to have bounded length.

On matrix IDs:

The length of a user ID, including the @ sigil and the domain, MUST NOT exceed 255 characters.

We'd have to choose DELIMITER to be something compatible with the historical user ids, which means it'll need to be a nonprinting ascii character.

Not sure if device ids have a maximum length in the spec?

Not that I can see. A quick look on Matrix.org shows that almost all (99.9995%) of device IDs are <= 43 characters (codepoints) long. But a large chunk of those are for signing keys, which aren't ever going to be used for syncing. Excluding singing key devices, almost all (99.9990%) devices are <= 16 characters (codepoints) long.

Useful for #51, small enough to include in isolation

DMRobertson · 2023-04-12T14:40:20Z

Notes:

The proxy doesn't care about the distinction between user id and hs device id. All the other tables are keyed off one text field so it really will be easier to concate $USER$DELIMITER$DEVICE.
One-off migration at startup to rewrite all tables. Write it in go code, make /whoami calls as needed.
In a refresh tokens world we do need to be aware that tokens will expire underneath us. So we basically need to solve Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18 as a prerequisite.
- We hold onto the since tokens. Try an incremental sync afterwards, if it's not gappy then all good; if it is gappy, drop everything and initial sync.
- Problem here: the proxy assumes that if we have room state in the proxy's DB, then it is up to date. Everything is based on the assumption of a single contiguous room timeline.
- If we commit to refreshing access tokens (and/or to fixing Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18) then that assumption is broken: the proxy might have a stale snapshot of room state. (We might get away with this if there have been only a few (10) events in the room and the poller doesn't get a gappy sync response---but that is not guaranteed. Could bump up that limit.)
- We suspect that we already have problems along these lines.

Could do:

Whenever a poller sees a state block in an incremental sync v2 response, fetch all of its events from the proxy's database by event IDs. If they all have NIDs they have already been processed by the proxy, and all is well. If there are some events missing a NID, prepend them to the timeline. (The timeline order will be wrong, but at least both the proxy and the client will now see up to date state---we've self-healed.) This needs to all happen within a single database transaction. This is also a bad place to be in so needs to be loudly logged and Sentried. something something eventually consistent.
- This solves the problem in Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18 in a completely different way. The trade-off is that we are accepting that rooms might have slightly out of sync state. But that seems very pragmatic. Allows us to keep the since token. Slogan: "support gappy state" (but not gappy messages).
The above doesn't account for missing message events. One might think that the proxy could backfill these (prev_batch etc) but that a) would be slow and b) would fetch events in topological ordering, not stream ordering. That breaks the assumptions of the state update logic.

DMRobertson · 2023-04-28T18:48:15Z

@sandhose pointed out to me that matrix-org/matrix-spec-proposals#3970 has landed---transaction IDs are scoped to devices (meaning userID + deviceID) and not "client session". I think the proxy was already effectively doing this, but will need to check and probably write a test for this.

DMRobertson · 2023-05-22T13:32:19Z

Problem: how do we distinguish between a device expiry and a token expiry?

If we don't, we'll end up keeping device-scoped data forever, in particular, to-device messages. This will be A Problem.

sandhose · 2023-05-22T15:27:13Z

I think one potential solution would be to make the proxy OIDC-native.
The general idea is that when there is an incoming access token, it can do the introspection itself to know the mxid + device_id, and exchange that token for a longer-lived token that refreshes independently from the client access token, but where the session is still linked to it. If the client session ends, the derived token will also be revoked.

The MAS side is not really ready for that though, so it depends how much of a problem this ends up being? We could also imagine a (temporary) hacky solution, where MAS calls some special endpoint on the proxy when a device gets deleted?

DMRobertson · 2023-05-22T16:57:57Z

Problem: how do we distinguish between a device expiry and a token expiry?

Maybe it's sufficient to look for soft or hard logouts, and expire entire devices on hard logouts?

sandhose · 2023-05-26T09:54:22Z

Long story short: soft-logout is not really usable in your case (and with Matrix refreshing tokens) and for a simple reason: the client should be able to track whenever their access tokens is about to expire, and the soft-logout would happen when it tries to refresh the token, which of course the proxy doesn't know about. The old, expired access tokens won't have a soft_logout: true in the error response.

hughns · 2023-05-26T10:04:58Z

Some alternatives that we can suggest for determining when device state can be purged:

having the HS explicitly call the proxy to tell it that a device has logged out
in the case of the proxy having more than one device for a user: periodically query the /devices endpoint to get list of active devices for a user and purge any that are no longer valid
purge any device after a period of inactivity (this might be sensible as a fallback anyhow?)
random thought: is the any "public" information like E2EE device keys that could be used to infer the validity of another user's device?

In the future OIDC model we have discussed where the proxy would exchange the access token received from the client with its own access/refresh token it would be handled nicely: when the original client is logged out the corresponding (exchanged) access token would also be invalidated. So, the proxy could purge data once its own access/refresh token is no longer valid.

kegsay · 2023-07-12T18:05:26Z

Why is this issue still open? We do scope to devices now. Anything else should be a separate issue (e.g cleaning up stale devices)

kegsay · 2023-07-12T18:06:01Z

#127 was already made, so closing this.

kegsay added this to the v1 milestone Apr 5, 2023

DMRobertson self-assigned this Apr 11, 2023

DMRobertson mentioned this issue Apr 11, 2023

Remove unreachable err handling in setupConnection #64

Merged

DMRobertson pushed a commit that referenced this issue Apr 11, 2023

Have WhoAmI extract the device_id

846197e

Useful for #51, small enough to include in isolation

DMRobertson mentioned this issue Apr 11, 2023

Have WhoAmI extract the device_id #65

Merged

This comment was marked as outdated.

Sign in to view

DMRobertson mentioned this issue Apr 12, 2023

Deal with out-of-sync room state, e.g. due to the last device logging out or leaving the room #18

Closed

pmaier1 added the roadmap label Apr 26, 2023

This was referenced Apr 26, 2023

Support refreshing tokens #87

Closed

Decouple access token tracking from device tracking #89

Closed

DMRobertson mentioned this issue May 10, 2023

Migrate existing data to new schema on startup #98

Merged

DMRobertson mentioned this issue May 23, 2023

Detect when a device expires (not just an access token) and cleanup #127

Open

kegsay closed this as completed Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope the connections to devices and not to a hash of the access token #51

Scope the connections to devices and not to a hash of the access token #51

sandhose commented Mar 30, 2023 •

edited

Loading

kegsay commented Mar 30, 2023

DMRobertson commented Mar 30, 2023

DMRobertson commented Apr 11, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

DMRobertson commented Apr 12, 2023 •

edited

Loading

DMRobertson commented Apr 28, 2023

DMRobertson commented May 22, 2023

sandhose commented May 22, 2023

DMRobertson commented May 22, 2023

sandhose commented May 26, 2023

hughns commented May 26, 2023

kegsay commented Jul 12, 2023

kegsay commented Jul 12, 2023

Scope the connections to devices and not to a hash of the access token #51

Scope the connections to devices and not to a hash of the access token #51

Comments

sandhose commented Mar 30, 2023 • edited Loading

kegsay commented Mar 30, 2023

DMRobertson commented Mar 30, 2023

DMRobertson commented Apr 11, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

DMRobertson commented Apr 12, 2023 • edited Loading

DMRobertson commented Apr 28, 2023

DMRobertson commented May 22, 2023

sandhose commented May 22, 2023

DMRobertson commented May 22, 2023

sandhose commented May 26, 2023

hughns commented May 26, 2023

kegsay commented Jul 12, 2023

kegsay commented Jul 12, 2023

sandhose commented Mar 30, 2023 •

edited

Loading

DMRobertson commented Apr 11, 2023 •

edited

Loading

DMRobertson commented Apr 12, 2023 •

edited

Loading