-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic migrations #9616
Comments
When thinking more about it, the option (4.) is the best. The option (1) is the easiest. |
@zmanian , @jackzampolin -- I guess in the past this was not a problem because we were always exporting the state (genesis) and it took a time anyway. |
You could do (1) and (4), doing (1) first and if you think you could do (4) in a reasonable amount of time. If not, then at least you can fallback on (1). |
I think it is helpful to clarify that they are only non atomic if KVStores are being added, renamed or deleted. That will only happen if modules are added or removed due to the multi store design. If there are no new modules and just migration code, the migrations will be atomic. |
yeah the way this was achieved in the past was via state exports. This is a good and important observation. We definitely need a checkpoint and recovery mechanism. Can we get some kind of primitive snapshot functionality into cosmovisor before the cosmoshub upgrade? |
@zmanian I think the easiest approach for now is for operators to just take a backup of their data directory prior to upgrading. |
As a practical matter this would mean telling people not to use cosmovisor during the upgrade and I've been asked to tell people to use cosmovisor. |
I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor? |
Yes, there's no straight way to rollback in case of upgrade failures. So need to wipe off the data and use backup/or resync. may be, one quick fix we can add to the cosmovisor is, take a backup if the disk space is available or via a ENV setting, defaults to |
Or may be simply,
|
ACK |
@ethanfrey I wonder if you have any insights on some way we might be able to provide a clean rollback? I do agree though that the safest thing is always to do a backup so having that as the default like @anilcse is suggesting seems wise. |
Also, are we rolling back the panic on empty minimum gas prices which caused the upgrade failures @anilcse @robert-zaremba ? Seems like we should just emit a warning until the next release. |
this just got merged |
Let's start with option 1. - full backup controlled by a flag. |
BTW: some upgrades can take a lot of memory (eg if we were to update all bank managed storage) if we keep the cash wrapper. |
Oh, @aaronc - I've just noticed that you added "Cosmovisor" to the title and label. I think it's not fully related to the cosmovisor, however it will be partially solved with:
And fully solved with ADR-40 snapshots. |
Right, sorry about that. |
<!-- The default pull request template is for types feat, fix, or refactor. For other templates, add one of the following parameters to the url: - template=docs.md - template=other.md --> ## Description Ref: #9616 (comment) depends: #8590 <!-- Add a description of the changes that this PR introduces and the files that are the most critical to review. --> This PR adds a full backup option for cosmovisor. `UNSAFE_SKIP_BACKUP` is an `env` setting introduced newly. - if `false` (default, **recommended**), cosmovisor will try to take backup and then upgrade. In case of failure while taking backup, it will just halt the process there and won't try the upgrade. - If `true`, the cosmovisor will try to upgrade without any backup. This setting makes it hard to recover from a failed upgrade. Node operators either need to sync from a healthy node or use a snapshot from others. --- ### Author Checklist *All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.* I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] added `!` to the type prefix if API or client breaking change - [ ] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [x] provided a link to the relevant issue or specification - [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] added a changelog entry to `CHANGELOG.md` - [x] included comments for [documenting Go code](https://blog.golang.org/godoc) - [x] updated the relevant documentation or specification - [x] reviewed "Files changed" and left comments if necessary - [ ] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable)
closing as this is touched on in the current store discussions |
The issue is still present, so I think we should keep it open until resolved. |
Summary
Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes.
Problem Definition
In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin.
When running a migration, changes are written to the disk.
If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.
Proposal
We need to find a more friendly mechanism to handle backup.
ADR-40 will make it easy because we use a DB level checkpoints. But that won't be implemented in 0.44.
Few proposals:
For Admin Use
The text was updated successfully, but these errors were encountered: