Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Implement etcd data validation #13

Open
Tracked by #1
shreyas-s-rao opened this issue Nov 6, 2023 · 0 comments
Open
Tracked by #1

[Feature] Implement etcd data validation #13

shreyas-s-rao opened this issue Nov 6, 2023 · 0 comments
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)

Comments

@shreyas-s-rao
Copy link
Collaborator

How to categorize this issue?

/kind enhancement

What would you like to be added:
Steward should perform validation of the etcd DB with the ability to detect data corruptions, check etcd DB file locks (and subsequent recovery), detect volume mismatches (and consequent EtcdMember status updation), and returns well-defined, non-overlapping error codes.

Changes to be made from the existing data validation logic in etcd-backup-restore:

  • Do not perform revision checks in case of multi-node clusters. While single-node etcds are guaranteed to have an etcd DB revision always greater than the latest snapshot revision from the backup store, this is not the case with multi-node etcds, since an etcd member can be lagging behind due to various reasons (network delays, member updates, member restarts, member restorations, etc). Hence, revision checks, which return success only if the etcd DB revision is greater than the latest snapshot revision from the backup store, should not be performed for multi-node etcd clusters
  • Revision checks are still necessary for single-node etcds. If the DB is not corrupt, but lags in revisions as compared to the backup store, then a partial restoration should be performed, as described in [Feature] Implement failure-tolerant etcd restoration #11 , and a full restoration should be avoided
  • Additional revision check upon a WAL flush, should not be performed for multi-node etcd clusters, since it does not provide any additional benefit to the data validation flow
  • Explore the usage of already available data validation/verification checks provided by upstream etcd, and keep steward code as lean as possible
  • Explore the usage of data corruption alarm to determine data validity; re-use [Feature] Handle etcd alarms #6 if necessary

Why is this needed:
Part of #1

@shreyas-s-rao shreyas-s-rao added the kind/enhancement Enhancement, improvement, extension label Nov 6, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

2 participants