Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

docs: Add an HA proposal doc #725

Closed
wants to merge 1 commit into from

Conversation

jodh-intel
Copy link
Contributor

Create a high availability (HA) proposal document.

Fixes #683.

Signed-off-by: James O. D. Hunt [email protected]
Contributions-by: Sebastien Boeuf [email protected]

@jodh-intel
Copy link
Contributor Author

@sameo, @grahamwhaley, @mcastelino, @dvoytik, @sboeuf, @devimc, @chavafg - please take a look and comment on this (very early) draft.

Create a high availability (HA) proposal document.

Fixes clearcontainers#683.

Signed-off-by: James O. D. Hunt <[email protected]>
Contributions-by: Sebastien Boeuf <[email protected]>
@clearcontainersbot
Copy link

kubernetes qa-passed 👍

Copy link
Contributor

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

currently stops, the hypervisor will be left running consuming a large
amount of CPU due to the agent attempting to reconnect to the proxy.
The reconnect behaviour is correct, but there is no timeout in the case
where the proxy needs to be manually stopped by an administrator for example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, a simple timeout here could make things easier. Something like 30 or 60 seconds with no connection from the host would trigger the end of the agent and the end of the VM would follow (we need to make sure that's the way agent service is set).

@clearcontainersbot
Copy link

kubernetes qa-passed 👍

@clearcontainersbot
Copy link

kubernetes qa-passed 👍

3 similar comments
@clearcontainersbot
Copy link

kubernetes qa-passed 👍

@clearcontainersbot
Copy link

kubernetes qa-passed 👍

@clearcontainersbot
Copy link

kubernetes qa-passed 👍

Copy link

@klynnrif klynnrif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple general comments: I've included some introductory sentences to follow section headings that didn't have them. I didn't suggest any title changes, but they could be more clear (e.g. "Current Situation" section and some of its subsections). I rewrote some areas extensively, please check to make sure I didn't change the meaning. Thanks!


## Overview

This document summarises the current failure behaviour of a Clear Containers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 28-29 rewrite suggestion: "This document summarizes how a Clear Container system behaves when it fails and provides proposals to make it more highly available."

system along with proposals for making it more highly available.

## Requirements

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest an introductory sentence before this list summarizing what these requirements accomplish, though I am unclear on what these requirements are for. Any suggestions?


## Requirements

- Ability for the Clear Containers system to be robust against all failure scenarios.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Clear Containers system must be robust against all failure scenarios.


- Ability for the Clear Containers system to be robust against all failure scenarios.

- Ensure no single point of failure.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensures no single point of failure.


- Ensure no single point of failure.

- Ensure all failure scenarions are reported by the logging mechanisms.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging mechanisms report all failure scenarios.



### Scenarios that need testing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested introductory line: Scenarios that need testing consist of disconnects, ENOSPC, ENOMEM, limits, and logging.


#### `ENOSPC`

Ensure all components handle a lack of disk space in a sane manner (by
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 182-183 suggested rewrite: Ensure all components handle a lack of disk space in a sane manner (i.e. reporting an error back to the caller).


#### `ENOMEM`

Ensure all components handle a lack of memory in a sane manner (by
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 187-188 suggested rewrite: Ensure all components handle a lack of memory in a sane manner (i.e. reporting an error back to the caller).


#### Limits

Test what happens when:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 192-199 suggested rewrite:

Test what happens for the following scenarios:

  • Cannot create anymore processes.
  • Cannot create anymore network connections.
  • Cannot use anymore file descriptors.
  • Cannot create anymore locks.
  • Cannot create anymore files.
  • Cannot create anymore inodes.


#### Logging

- Ensure all components log full error details to ensure problem
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 203-204 suggested rewrite: Ensure all components log full error details so that you can fully determine problems.

Copy link

@rcaballeromx rcaballeromx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the issues marked by @klynnrif

@jodh-intel
Copy link
Contributor Author

This work is superceded by Kata Containers so closing this for now.

@jodh-intel jodh-intel closed this Dec 6, 2017
mcastelino pushed a commit to mcastelino/runtime that referenced this pull request Dec 6, 2018
…ebug-output

kata-env: Fix display of debug options
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants