Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

cc-proxy/cc-shim high availability #4

Open
sameo opened this issue Jan 24, 2017 · 17 comments
Open

cc-proxy/cc-shim high availability #4

sameo opened this issue Jan 24, 2017 · 17 comments

Comments

@sameo
Copy link

sameo commented Jan 24, 2017

From @sameo on December 2, 2016 17:36

If cc-proxy crashes:

  1. all cc-shim instances terminate.
  2. cc-proxy will not be able to restore its internal state after restarting.

We need to work on:

  1. Have cc-shim retry connecting to the proxy when the socket is closing/disappearing
  2. Have cc-proxy re-build all its states when restarting, based on the stored information

Copied from original issue: intel/cc-oci-runtime#505

@sameo
Copy link
Author

sameo commented Jan 24, 2017

From @laijs on December 4, 2016 23:46

the virtio-serial is not package based transport, it seams hard to find the message header when cc-proxy re-connect to hyperstart.

dlespiau pushed a commit to dlespiau/clearcontainers-proxy that referenced this issue May 5, 2017
We close the shim connection when something bad happens:
  - We receive an error trying to write to the socket (most likely
  because the shim died or exited).
  - We have an other kind of unrecoverable error for that client (we
  don't have one right now but we will in the near future)

We now want to progress a bit on the recovery side of things. If the
shim dies, we want to allow it to reconnect and re-claim the session.
This commit does just that.

This is tested by a subsequent unit test: TestShimSendStdinAfterExeccmd

Updates: clearcontainers#4

Signed-off-by: Damien Lespiau <[email protected]>
dlespiau pushed a commit to dlespiau/clearcontainers-proxy that referenced this issue May 5, 2017
We close the shim connection when something bad happens:
  - We receive an error trying to write to the socket (most likely
  because the shim died or exited).
  - We have an other kind of unrecoverable error for that client (we
  don't have one right now but we will in the near future)

We now want to progress a bit on the recovery side of things. If the
shim dies, we want to allow it to reconnect and re-claim the session.
This commit does just that.

This is tested by a subsequent unit test: TestShimSendStdinAfterExeccmd

Updates: clearcontainers#4

Signed-off-by: Damien Lespiau <[email protected]>
dlespiau pushed a commit to dlespiau/clearcontainers-proxy that referenced this issue May 17, 2017
We close the shim connection when something bad happens:
  - We receive an error trying to write to the socket (most likely
  because the shim died or exited).
  - We have an other kind of unrecoverable error for that client (we
  don't have one right now but we will in the near future)

We now want to progress a bit on the recovery side of things. If the
shim dies, we want to allow it to reconnect and re-claim the session.
This commit does just that.

This is tested by a subsequent unit test: TestShimSendStdinAfterExeccmd

Updates: clearcontainers#4

Signed-off-by: Damien Lespiau <[email protected]>
dlespiau pushed a commit to dlespiau/clearcontainers-proxy that referenced this issue May 18, 2017
We close the shim connection when something bad happens:
  - We receive an error trying to write to the socket (most likely
  because the shim died or exited).
  - We have an other kind of unrecoverable error for that client (we
  don't have one right now but we will in the near future)

We now want to progress a bit on the recovery side of things. If the
shim dies, we want to allow it to reconnect and re-claim the session.
This commit does just that.

This is tested by a subsequent unit test: TestShimSendStdinAfterExeccmd

Updates: clearcontainers#4

Signed-off-by: Damien Lespiau <[email protected]>
@dvoytik
Copy link

dvoytik commented Jul 14, 2017

Hi @dlespiau, are you doing anything related to this feature? If not then I'd like to hack on this if you don't mind.

@dlespiau
Copy link
Contributor

Hi,

I'm doing the low level part of this, framing on top of the Host<->VM serial link so the proxy can recover the start of a frame when reconnecting to a running VM.

I haven't started on the task to save an on-disk state that the proxy can read from when starting again though. You could take that part.

@dvoytik
Copy link

dvoytik commented Jul 18, 2017

Hi @dlespiau,

That's awesome!
I've started experimenting exactly with on-disk re/store of the state as it's most obvious part for me. Okay. When I have something substantial to show I'll post here a WIP PR.

Cheers.

@jodh-intel
Copy link
Contributor

Thanks @dvoytik! Feel free to create an issue and assign to yourself (and maybe reference this issue) so it's clear to the whole team that that is something you're working on.

@dvoytik
Copy link

dvoytik commented Jul 18, 2017

@jodh-intel, done. Although I can't assign it to myself.

@jodh-intel
Copy link
Contributor

@dvoytik - thanks - assigned.

dvoytik pushed a commit to dvoytik/proxy that referenced this issue Aug 18, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Aug 21, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Aug 21, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Aug 21, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 3, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 5, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 5, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 5, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik added a commit to dvoytik/proxy that referenced this issue Oct 9, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik added a commit to dvoytik/proxy that referenced this issue Oct 9, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 11, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 11, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
@sboeuf
Copy link
Contributor

sboeuf commented Oct 12, 2017

@dlespiau any chance you have left some work in progress about the re-sync of a lost frame between proxy and VM serial port ?

@dlespiau
Copy link
Contributor

Unfortunately, the work has been wiped out when I dd'ed /dev/urandom to my hard-drive :/

@sboeuf
Copy link
Contributor

sboeuf commented Oct 12, 2017

@dlespiau no worries, that's what I was expecting :p
That's what you do when you move to something else !

@sboeuf
Copy link
Contributor

sboeuf commented Oct 12, 2017

@dlespiau BTW, we have a public IRC channel #clearcontainers on freenode. Come discuss about containers if you're interested ;)

@jodh-intel
Copy link
Contributor

@sboeuf - could you outline what you know about this problem?

@sboeuf
Copy link
Contributor

sboeuf commented Oct 13, 2017

@jodh-intel I'll go further, trying to cover all the cases, and how our components should be modified.
The case is simple, we have Clear Containers running, meaning all components runtime/shim/proxy/VM(agent) are up and running. When the proxy crashes, we have shim/runtime/agent detecting the proxy disconnection while they are trying to communicate with.

Here what should do all the components upon this detection:

  1. Shim
  • Try to reconnect for some time (already handled by this PR connect: Try to re-connect to proxy shim#54)
  • Buffer all the inputs and signals that cannot be forwarded to the proxy while it is getting restarted. When the connection is established again, the shim should send everything that has been buffered.
  • Save the last command so that we can re-send it after the reconnection to the proxy. Otherwise it's gonna be lost...
  1. Agent
  • Handle gracefully the re-connection of the proxy to continue properly where we left.
  • Buffer all outputs supposed to go through STDOUT/STDERR, so that the agent can send them after the proxy re-connection. Specific to IO channel.
  • Save the last command that got executed while the proxy was crashing, and save the result. The idea is that when the proxy is gonna reconnect, it's gonna send again the last command because it didn't get the result (this command is really gonna be triggered by the shim or the runtime when they reconnect). For that reason, the agent should analyze the command sent by the proxy after it reconnects, and not execute it (in case that matches the last command), but send the saved result, to avoid running the same command a second time. That way, the proxy will receive the result of this command.
  • In the same way, we should always save the last outputs that we are sending. This would allow the agent to resend the result when the same command is submitted from the shim or runtime, but that we don't want to re-run the command for real, because it could have different results.
  1. Runtime
  • Try to reconnect to the proxy.
  • Re-send the failing command. We should not report the command as failing unless this is not related to the proxy crash, but really because of an agent error.
  1. Proxy
  • Save the most recent states as soon as a modification occurs. Basically, every time a new token/sessionID is created because the runtime asked for. We need the proxy to have the exact map of tokens and seesion IDs when it gets recovered, so that it can directly receive outputs coming from the agent (the last one that never made it through + buffered ones).

@sameo @grahamwhaley @jodh-intel I might have missed few corner cases, but I'd like to get your input on this. This is pretty important since we need to agree before we can open the corresponding issues and start the implementation.

dvoytik pushed a commit to dvoytik/proxy that referenced this issue Oct 15, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
@jodh-intel
Copy link
Contributor

Hi @sboeuf - thanks for this. If you don't mind, I'll merge the above with my notes and put it into a draft design (clearcontainers/runtime#683) doc showing (a) what we have today and (b) what we want in the future...

@jodh-intel
Copy link
Contributor

@sboeuf - I've now raised a doc PR including your comments above:

@sboeuf
Copy link
Contributor

sboeuf commented Oct 16, 2017

@jodh-intel great thanks !

@sboeuf
Copy link
Contributor

sboeuf commented Oct 16, 2017

But I'd like to get some feedback about it too. Does that make sense for everyone ?

dvoytik pushed a commit to dvoytik/proxy that referenced this issue Nov 7, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Nov 7, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Nov 7, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Nov 7, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
dvoytik pushed a commit to dvoytik/proxy that referenced this issue Nov 7, 2017
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants