-
Notifications
You must be signed in to change notification settings - Fork 70
Conversation
864e2c2
to
f296baf
Compare
@sameo, @grahamwhaley, @mcastelino, @dvoytik, @sboeuf, @devimc, @chavafg - please take a look and comment on this (very early) draft. |
Create a high availability (HA) proposal document. Fixes clearcontainers#683. Signed-off-by: James O. D. Hunt <[email protected]> Contributions-by: Sebastien Boeuf <[email protected]>
f296baf
to
bcc2134
Compare
kubernetes qa-passed 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
currently stops, the hypervisor will be left running consuming a large | ||
amount of CPU due to the agent attempting to reconnect to the proxy. | ||
The reconnect behaviour is correct, but there is no timeout in the case | ||
where the proxy needs to be manually stopped by an administrator for example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that, a simple timeout here could make things easier. Something like 30 or 60 seconds with no connection from the host would trigger the end of the agent and the end of the VM would follow (we need to make sure that's the way agent service is set).
kubernetes qa-passed 👍 |
kubernetes qa-passed 👍 |
3 similar comments
kubernetes qa-passed 👍 |
kubernetes qa-passed 👍 |
kubernetes qa-passed 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple general comments: I've included some introductory sentences to follow section headings that didn't have them. I didn't suggest any title changes, but they could be more clear (e.g. "Current Situation" section and some of its subsections). I rewrote some areas extensively, please check to make sure I didn't change the meaning. Thanks!
|
||
## Overview | ||
|
||
This document summarises the current failure behaviour of a Clear Containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 28-29 rewrite suggestion: "This document summarizes how a Clear Container system behaves when it fails and provides proposals to make it more highly available."
system along with proposals for making it more highly available. | ||
|
||
## Requirements | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest an introductory sentence before this list summarizing what these requirements accomplish, though I am unclear on what these requirements are for. Any suggestions?
|
||
## Requirements | ||
|
||
- Ability for the Clear Containers system to be robust against all failure scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Clear Containers system must be robust against all failure scenarios.
|
||
- Ability for the Clear Containers system to be robust against all failure scenarios. | ||
|
||
- Ensure no single point of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensures no single point of failure.
|
||
- Ensure no single point of failure. | ||
|
||
- Ensure all failure scenarions are reported by the logging mechanisms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging mechanisms report all failure scenarios.
|
||
|
||
### Scenarios that need testing | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested introductory line: Scenarios that need testing consist of disconnects, ENOSPC
, ENOMEM
, limits, and logging.
|
||
#### `ENOSPC` | ||
|
||
Ensure all components handle a lack of disk space in a sane manner (by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 182-183 suggested rewrite: Ensure all components handle a lack of disk space in a sane manner (i.e. reporting an error back to the caller).
|
||
#### `ENOMEM` | ||
|
||
Ensure all components handle a lack of memory in a sane manner (by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 187-188 suggested rewrite: Ensure all components handle a lack of memory in a sane manner (i.e. reporting an error back to the caller).
|
||
#### Limits | ||
|
||
Test what happens when: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 192-199 suggested rewrite:
Test what happens for the following scenarios:
- Cannot create anymore processes.
- Cannot create anymore network connections.
- Cannot use anymore file descriptors.
- Cannot create anymore locks.
- Cannot create anymore files.
- Cannot create anymore inodes.
|
||
#### Logging | ||
|
||
- Ensure all components log full error details to ensure problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 203-204 suggested rewrite: Ensure all components log full error details so that you can fully determine problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the issues marked by @klynnrif
This work is superceded by Kata Containers so closing this for now. |
…ebug-output kata-env: Fix display of debug options
Create a high availability (HA) proposal document.
Fixes #683.
Signed-off-by: James O. D. Hunt [email protected]
Contributions-by: Sebastien Boeuf [email protected]