Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ending a session gracefully #31

Open
meejah opened this issue Sep 12, 2022 · 26 comments
Open

Ending a session gracefully #31

meejah opened this issue Sep 12, 2022 · 26 comments

Comments

@meejah
Copy link
Member

meejah commented Sep 12, 2022

Consider magic-wormhole connections as having two layers: the generic setup of a channel (or many subchannels in Dilation) and an "application layer" on top (such as file-transfer).

Most "application layer" protocols will want a way to gracefully end their session.
(I use "session" here to include Dilation because those may span many TCP connections).

The file-transfer protocol has an implicit end-of-session: any offer that is exchanged is the only one, so no more mailbox communication will happen after that.

However, the next-generation file transfer (see magic-wormhole/magic-wormhole-protocols#23 ) for example needs some way to indicate that the entire session is over.

I expect all Dilation protocols will have a similar need. In Dilation specifically, we could choose to declare that closing the control-channel is this "session is over" signal. However: that doesn't work for non-Dilated protocols; and the control-channel may never exist (because of network conditions). So, we need a mailbox-level message here.

Another consideration (thanks @warner) is synchronization around close signals. In subchannels in Dilation, there is a "half-closed" state where side A has sent all the data they intend to, and sent a CLOSE message. So side A knows side B will see all their data before the CLOSE. Similarly, since side A will wait for side B's CLOSE message, they now they'll see all of side B's data. Maybe it is possible to extend this synchronization to the mailbox itself.

Currently, the w.close() API sends a CLOSE message to the relay, which responds with a CLOSED (but doesn't, for example, send a CLOSED to the other side). So an option could be to re-use this signal -- amend the protocol so that either side could receive a CLOSED message that isn't in response to a CLOSE. This kind of breaks the symmetry of close/closed though.

I think an explicit mailbox-level message + phase should be added to the protocol. This can be used by both "classic" Transit-using (or even mailbox-only) protocols and by Dilation. It also allows proper recover in "journaled" mode, I believe.

The rough idea is this:

  • one side (call it A) sends its final mailbox message
  • it then sends a phase=closing mailbox message with body {"closing": true}
  • after sending such a message, that side MUST NOT post any more mailbox messages (that is, no more number-phase application messages)
  • upon seeing the phase=closing message, side B posts any remaining mailbox messages as then a final phase=closing message with body {"closed": true}.
  • side B MUST close its mailbox immediately after this (sends CLOSE to mailbox-server, gets CLOSED back) -- it knows side A will produce no new messages
  • side A receives any additional mailbox messages from side B (maybe none)
  • side A gets a phase=closing with {"closed": true} from side B
  • side A now knows it has all mailbox messages, and MUST immediately close its mailbox (because it knows the other side has done this also)
  • the session is now concluded
@meejah
Copy link
Member Author

meejah commented Sep 12, 2022

(If we agree this sounds plausible, I will instead express it as a PR against the protocol specs alongside Python implementations)

@piegamesde
Copy link
Member

I'd like to leave this up to the application level protocol how to deal with session closing, as the needs are really diverse. Between rendezvous messages, Dilation and Transit, there are seven different combinations of communication channels that may be used, of which five are practicable (using Dilation and Transit connections at once makes no sense). Each application should decide on one communication channel as their "source of truth", and implement a proper closing mechanism there.

However, I agree that there is need to close a mailbox for all participants (instead of having a per-client counter and only closing after the last one is gone), see for example #19

@meejah
Copy link
Member Author

meejah commented Sep 13, 2022

"Close a mailbox for all participants" is precisely the use-case being addressed here.

(And as a bonus, it should serve as a "source of truth" for many protocols -- of course, other protocols could choose a different way to shut down if desired).

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

I'd like to leave this up to the application level protocol how to deal with session closing, [..]

The reason I don't want to do this is precisely because of possible confusion, and because I believe all application protocols need this feature. Therefore, it makes sense to have a built-in way to do it.

The mailbox is the correct communication channel because it's the only one that definitely exists. Transit or Dilation may never be established. Since an application protocol that only uses the mailbox is possible, it has to be the mailbox channel.

Note that the existing file-transfer application protocol does the session-closing implicitly: only a single offer is allowed, and only a single file-transfer will occur. Thus, the Transit connection never needs to stay open and there is no ambiguity as to when the each side should close the mailbox. There is a remaining problem, though: if a transfer is cancelled. There is no explicit way to signal this, and the protocol suffers some ambiguous behavior in those cases. An explicit "close" signal like the one described here could answer that need.

For Dilation the above is no longer true: the session may stay open for an arbitrary amount of time.

The only other application protocols I know of are magic-folder and tahoe-lafs Grid invitations: both of those also have a non-ambiguous end, so don't strictly need this mechanism. However, they do not have a way to cancel an ongoing invite -- this could be used for that.

@piegamesde
Copy link
Member

The rendezvous server already enforces that at most two parties may be in one room. Therefore I'd like to expand the semantics some more on that premise:

  • If one side sends CLOSE the mailbox should actually be closed (instead of waiting for the other side to CLOSE it too). The other side should be notified about that, either through CLOSED, a custom message or some error message.
    • This should not cause any breakage with current implementations, if anything it might even fix some hanging/deadlock bugs
  • We could even make it so that a closed mailbox cannot be reopened (for some reasonable amount of time), this would solve the "cancel the invitation" problem. (Actually, this should probably be done on the nameplate level, because then the original file transfer would profit from it too).

The only other application protocols I know of are magic-wormhole and tahoe-lafs Grid invitations

There are also my port forwarding and transfer v2 experiments (the former being released but marked as experimental feature), both automatically close the wormhole as soon as a transit connection is established and use that for subsequent error handling and everything else.

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

Changing the semantics of CLOSE I think has a problem with giving both sides a chance to send any "final" messages. It's more like pulling the ethernet out of the wall to close TCP.

That is, if side A sends a CLOSE it may be that side B has messages in-flight that will now never get delivered (that is, side A never has a chance to let side B finish). The phase approach outlined above does give each side a chance to synchronize and come to agreement about closing (i.e. there's an end-to-end messaging flow instead of one side arbitrarily shutting down the mailbox). In spirit, this is like TCP shutdown with FIN then ACK -- the "fin" is the "phase=closing" from side A (which then enters a state similar to closing in TCP) and the "fin-ack" is the "phase=closing" from side B back to side A (TCP is somewhat more complex than this, so it's just an analogy).

Note that applications don't have to use the above proposed flow as the only way to shut down. This would still work with the port-forwarding flow you outline, I think? That is, you follow the above phased close once Transit is established. (In fact, you could ignored the phased close entirely and have each side implicitly know when to CLOSE their side of the mailbox -- that is, when Transit is established).

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

(Note that I'm not totally against using CLOSE and/or changing its semantics, but I am against making those semantics that one side can just immediately shut off the mailbox w/o any end-to-end flow / acknowledgement).

@piegamesde
Copy link
Member

Hm, do you have an example of a protocol that might make use of that close behavior on the happy path? I understand your proposal, but it adds some significant complexity to the protocol so I need a compelling use-case to convince me it is worth the cost.

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

Dilated File Transfer

edit: to expand, there isn't a way to "end the session" in the proposed Dilated File Transfer application protocol. This would be it.

@piegamesde
Copy link
Member

Actually, Dilated file transfer can use that same mechanism on the Dilation level: closing the control channel will already do the closing/closed dance and then once the control channel is closed the Wormhole may safely be closed without need for further negotiation.

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

Right, the control-channel could be used (however, this is more general -- and covers the cases where we've never actually succeeded at opening a Dilation channel -- which may never succeed). That is, the only communications channel the definitely exists is the mailbox.

@piegamesde
Copy link
Member

cases where we've never actually succeeded at opening a Dilation channel -- which may never succeed

That's why I was specifically talking about the happy path. On the unhappy path, I think just closing in the face and triggering errors on the other side is fine, because it would have been an error anyways.

For example, in a dilated file transfer session one side decides that it wants to close. If there are any ongoing transfers those will be awaited, but it will reject all new offers and make no send offers either. Once the last transfer completed, it can simply close the connection (ignoring graceful shutdown of the Dilation control channel for now). If the other side was to make an offer attempt concurrently it would simply get a connection lost error, which would be fine and expected.

@meejah
Copy link
Member Author

meejah commented Oct 27, 2022

Again, I'm not saying we can't use the control-connection (in fact, that was my first attempt at this) but it's not as general (i.e. it doesn't help mailbox-only protocols). (It's Dilation-only -- and of course depends on that channel being established in the first place.

Application-protocols inventing baroque ways of shutdown is precisely why I wanted a more-general way (we've already got 3 different ways to shut down with just 2 protocols under consideration here).

The "ongoing transfers awaited" part is why I believe we need an intermediate state between "open" and "closed" (i.e. "closing", where no new messages / offers / etc are sent but ongoing things including in-flight mailbox messages can be completed).

There's a few options proposed, maybe it makes sense to summarize the pro/con of them and missing features etc. Are there any besides:

  • unilateral mailbox close from either side (i.e. change semantics of CLOSE)
  • custom application-protocol messaging
    • implicit (somehow implied by other messaging / state of the protocol)
    • close control subchannel (Dilation-only)
    • (custom message?)
  • a "phase" approach (phase=closing)

There's another edge-case here which might need more-careful wording of the spec: side A has decided to stop the Dilated File Transfer session (i.e. entered phase=closing), but there's at least one ongoing transfer. Then, one side loses their dilation network connection. So, you should be allowed to do more dilation-* phase messaging (the text above does already prohibit just number-phase messaging) once you're in "closing" phase. For Dilation we need to be clear about control messages (i.e. a side in phase=closing can't send any more?).

This allows the subchannel with the ongoing transfer to be re-established (via a further dilation-* generation) and that transfer can conclude properly. Then the other side enters phase=closing etc etc and it shuts down gracefully by sending CLOSE on all subchannels (including control) and awaiting the ACK.

@meejah
Copy link
Member Author

meejah commented Oct 28, 2022

Maybe this is easier to analyze as a pull-request or better write-up than a github comment ... do we at least agree that a unilateral CLOSE doesn't cover a happy-path shutdown of Dilated File Transfer..?

@piegamesde
Copy link
Member

Could we make the feature optional so that applications may use either depending on what they need? I'd be in for the change of giving the current CLOSE signal stronger properties, and adding a second shutdown mechanism for other use cases.

@meejah
Copy link
Member Author

meejah commented Oct 31, 2022

By "giving CLOSE stronger properties" you mean changing the protocol so that either side can unilaterally close a mailbox?

@meejah
Copy link
Member Author

meejah commented Nov 1, 2022

What problem does that solve?

@piegamesde
Copy link
Member

When the send side abandons before ever seeing the other side and then the other side joins, it will be stuck currently. This could be avoided. Probably also helps fixing #19.

@meejah
Copy link
Member Author

meejah commented Nov 1, 2022

It could still be stuck with the CLOSE change (e.g. if it "abandoned" in a way such that it didn't actually get to send a CLOSE). So it would help in case the first side "explicitly cancels" before any other side shows up (which of course a phase=closing would also handle) but not if that side simply "went away" for e.g. network reasons.

I don't think the change helps with 19, because neither Bob nor Carol fully join the mailbox if I understand it correctly (so they wouldn't have a chance to send CLOSE).

My problem here is this: changing the behavior of CLOSE is a backwards-incompatible protocol change. If we're going to introduce a new protocol version we should handle more of the cases correctly (because this change doesn't help happy-path or cancel-path Dilation shutdown, for which an answer is required).

@piegamesde
Copy link
Member

Actually I think there might be some issues about a close phase that we need to think through first:

  • All phases except pake are encrypted, but a close message may come before any encryption could be established. Should it be encrypted or not?
  • How will the message ordering work? For example, if a side receives phases up to 3 and then close, how will it know whether to wait for a 4 before forwarding the close or not?

@meejah
Copy link
Member Author

meejah commented Nov 2, 2022

The "phase" itself isn't encrypted ... but yeah, what to do with the payload. Could we special-case this part to say that if you receive any phase=closing message before the pake has concluded, cancel everything? We don't need the half-close behavior here because there cannot be any application-phase messages and so also don't need the thing proposed below. But also a special-case isn't ideal.

Ordering is indeed an interesting question. One solution might be to include information in the phase=closing body declaring the number of the last message sent (e.g. "last-message": 4 for your example above).

But, let's see: a client re-connects to a mailbox and receives:

  • 1
  • 2
  • closing with {"closing": True}
  • 3
  • 4

Can it still operate correctly?

With the extra information knowing that 4 is the last application message, it seems straightforward (it knows to await that message before completely closing itself). Without that, it could send its own closing message at any time (presuming it doesn't have any more messages to send itself). Without careful words in the spec, it could also consider the 3 and 4 messages to be protocol errors (being sent after the other side entered closing) so we should be clear either way about that.

The other side won't fully close (i.e the second phase=closing message with {"closed": True} until it sees the first side send its own closing. However that could at any point in the above (right before 3 for example, but also before 2).

I'll try writing a more formal spec with all the above considerations as well, then we can see if any other edge-cases etc come to mind.

@piegamesde
Copy link
Member

Going back to this, I'd like to explore a solution that works with the close message some more. And I think we can make it work so that the mailbox itself can be "half-closed".

The reason I want the the Mailbox server to be directly involved (compared to just sending yet another message over the Mailbox to indicate closing), is that otherwise a client might disconnect before sending a first message, leaving the other side hanging. But if the Mailbox server is involved, it can act like if the disconnected side closed the Mailbox and thus correctly notify the other side.

@meejah
Copy link
Member Author

meejah commented Mar 22, 2023

Adding another phase doesn't necessarily rule out "the server being involved". Anyway that sounds like an important case to capture no matter what the ultimate solution is.

@meejah
Copy link
Member Author

meejah commented Mar 26, 2023

p.s. since the phase is unencrypted, using the "phase" approach can allow "the server to be involved" it it needs to be. For example, it could enforce the requirement that a side who has sent a "phase=closing" must not send any more number-phase messages.

After re-reading this, I think we could/should encrypt the close message-body (that tells the other side the highest number-phase message sent) while still allowing the server to do the proper things:

  • on "early" close, before we get to encrypted bodies, the server can observe this and know this mailbox is in "abandoned before PAKE completed" state (I don't know that it needs to do anything special, though: if the other peer does arrive, it'll know it was abandoned due to a "phase=closing" message existing).
  • on "happy path" the server can enforce the restriction that the side sending "phase=closing" MUST NOT send any new messages

Note that for "early close" we can't encrypt a message-body -- but also we don't need one, because there's no 'highest message' to communicate. However, that should be made clear in any spec.

@meejah
Copy link
Member Author

meejah commented Mar 26, 2023

Considering the state diagram in #36 I'm not sure we'd need to do anything at all on the server -- it can still consider a mailbox "open" while the peers do their two phase=closing dances.

However, it could split open into three states: "open", "half-open", "closed" (keeping "done" as well) to distinguish the above situations. "half-open" is when one peer has sent "phase=closing", and "closed" is when the other peer has sent its "phase=closing". Then "done" is reached after both sides do "close".

We didn't talk about versioning yet. Maybe this needs a version signal...

Lets consider back/forward interop without an explicit version signal:
If the peers are old, we'd skip straight to "closed" when either side does a "close" I suppose. If the peers are mixed: not sure. The new side will issue a "phase=closing" so we'd get to "half-open" but then the old side will never do a "phase=closing" so the new-side will never learn when its peer closes. A new-server could almost just issue the correct message, but it can't encrypt the body (so can't properly encode the "highest number"). So, we could either make clients not depend on getting that, or put a versioning signal in (so a new side knows if the other side knows about phase=closing or not).

@meejah
Copy link
Member Author

meejah commented Mar 30, 2023

Meeting today, @piegamesde and I tentatively decided to pursue an approach limited to Dilation: essentially use application-layer messages similar to the "phase=closing" or the "close/closing/closed" ideas above.

This avoids changing the core protocol and gives us some operational experience with the concept -- while still leaving open the possibility of bringing this up to the core protocol level later. (Since Dilation is still marked "experimental" it's somewhat less disruptive as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants