-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to restore snapshot - snapshot missing hash #749
Comments
@emaildanwilson @jvassev thanks for opening the issue. Are you folks seeing this on the newly released patch version Will look into this, since we haven't observed this behavior yet on |
@emaildanwilson @jvassev The Etcd and backup-restore versions you're using are not compatible anymore, pls see the release notes of v0.25.0. You're using older version of Etcd custom image ( which uses a base image ) with newer version of To make it work, you have two options,
If this works, you can close the issue. If it doesn't, please let us know :) |
Thank you! We'll make the change and see if the issue still happens. |
@anveshreddy18 Thanks for the advice! I decided to downgrade |
@jvassev @emaildanwilson , I'd say that's a good enough starting point along with the other files in the I would also suggest you take a look at We have quite a major release coming up as well (in the next couple weeks), Let us know if We strongly urge that you move away from |
/assign |
Hi @jvassev @emaildanwilson ,
It looks like etcd failed to append
|
We have to detect the corrupt snapshot or missing hash of snapshot (taken via snapshot) early rather than waiting till restoration. I have following methods in my mind:
This method will work but starting an embedded etcd while taking a full-snapshot and wait for restoration to complete can be a little time taking but it still, it will safe guard us from such scenario where restoration can be failed due to corrupt snapshot.
But, I'm not sure about how to calculate the hash of db till x revision ? Is there any api call available for that ? I guess api call HashKV won't work here as value return by HashKV api call till x revision can't be equal to the Hash of snapshot taken upto x revision (removing the appended hash) as HashKV api call calculates the hash of all MVCC key-values, whereas snapshot is snapshot of etcd db which also contains cluster information, hence the hash will not be same.
|
Running with the suggested versions:
Occasionally this happens:
I think it may have to do abrupt node restarts since we mostly see this on spot instances. Also including the full log from an attempted restore:
|
Describe the bug:
Incomplete full backups can get stored onto S3 and the automatic restore fails on startup.
Expected behavior:
If a backup is missing the hash it should never get copied to S3 or the restore should be able to handle skipping incomplete backups with an option.
How To Reproduce (as minimally and precisely as possible):
We don't have specific reproduction steps but have hit this issue almost daily when using spot instances to run. We suspect that an incomplete file is written and it is packaged up and sent to S3 possibly after a restart but thats just a guess based on the behavior we're seeing.
Logs:
backup files in S3 and there sizes. note that the latest full backup is smaller than the previous
command to create the restore
restore error message
if run with
skip-hash-check=true
it segfaults:Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?:
backup sidecar configuration:
If we manually delete the bad full backup then the restore completes successfully.
@jvassev can provide more details if needed.
The text was updated successfully, but these errors were encountered: