Atomic migrations #9616

robert-zaremba · 2021-06-30T22:23:45Z

Summary

Today, migrations are not always atomic. If a migration fails, we don't have an easy solution to rollback the changes.

Problem Definition

In-place migration are great addition to the SDK. However if not tested carefully, they can cause extensive problems to a node admin.
When running a migration, changes are written to the disk.

if we add a new store, then it's committed immediately
if someone will not use cache wrapped store with open and commit phase (eg because it would be too operational memory consuming).

If a migration fails with an operation listed above, we are leaving a node in a corrupted, possibly unmanageable state. The only way to recover is to sync from another healthy node, or restore a backup.

Proposal

We need to find a more friendly mechanism to handle backup.

ADR-40 will make it easy because we use a DB level checkpoints. But that won't be implemented in 0.44.

Few proposals:

flag to cosmovisor to copy the DB (that will basically double the disc requirements).
add more options to migration
filesystem level rollback based on journal
implement ADR-40 based checkpoint mechanism in store - that would limit the number of supported databases and would require DB migration for many nodes.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

robert-zaremba · 2021-06-30T22:25:15Z

When thinking more about it, the option (4.) is the best. The option (1) is the easiest.

robert-zaremba · 2021-06-30T22:27:09Z

@zmanian , @jackzampolin -- I guess in the past this was not a problem because we were always exporting the state (genesis) and it took a time anyway.

alexanderbez · 2021-06-30T22:52:44Z

You could do (1) and (4), doing (1) first and if you think you could do (4) in a reasonable amount of time. If not, then at least you can fallback on (1).

aaronc · 2021-06-30T23:26:03Z

I think it is helpful to clarify that they are only non atomic if KVStores are being added, renamed or deleted. That will only happen if modules are added or removed due to the multi store design. If there are no new modules and just migration code, the migrations will be atomic.

zmanian · 2021-07-02T13:44:34Z

yeah the way this was achieved in the past was via state exports. This is a good and important observation. We definitely need a checkpoint and recovery mechanism.

Can we get some kind of primitive snapshot functionality into cosmovisor before the cosmoshub upgrade?

alexanderbez · 2021-07-02T14:13:37Z

@zmanian I think the easiest approach for now is for operators to just take a backup of their data directory prior to upgrading.

zmanian · 2021-07-02T14:16:35Z

As a practical matter this would mean telling people not to use cosmovisor during the upgrade and I've been asked to tell people to use cosmovisor.

alexanderbez · 2021-07-02T14:27:50Z

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

anilcse · 2021-07-02T14:39:52Z

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

Yes, there's no straight way to rollback in case of upgrade failures. So need to wipe off the data and use backup/or resync.

may be, one quick fix we can add to the cosmovisor is, take a backup if the disk space is available or via a ENV setting, defaults to true. If it's set to false, cosmovisor can just try upgrading and in case of failure, node operators can fix it manually later. In other case, where backup is available, in case of failure, cosmovisor will use skip-upgrade and restart with old binary.

anilcse · 2021-07-02T14:44:22Z

I think they could/should still use it. In the case of an error, they'd have to wipe the resulting data directory and restart cosmovisor?

Yes, there's no straight way to rollback in case of upgrade failures. So need to wipe off the data and use backup/or resync.

may be, one quick fix we can add to the cosmovisor is, take a backup if the disk space is available or via a ENV setting, defaults to true. If it's set to false, cosmovisor can just try upgrading and in case of failure, node operators can fix it manually later. In other case, where backup is available, in case of failure, cosmovisor will use skip-upgrade and restart with old binary.

Or may be simply,
UNSAFE_UPGRADE:

If true, the cosmovisor will try to upgrade without any backup
if false (default), cosmovisor will try to take backup and then upgrade. In case of failure while taking backup, it will just halt the process there and won't try the upgrade.

alexanderbez · 2021-07-02T14:51:56Z

ACK

aaronc · 2021-07-02T15:07:59Z

@ethanfrey I wonder if you have any insights on some way we might be able to provide a clean rollback?

I do agree though that the safest thing is always to do a backup so having that as the default like @anilcse is suggesting seems wise.

aaronc · 2021-07-02T15:09:48Z

Also, are we rolling back the panic on empty minimum gas prices which caused the upgrade failures @anilcse @robert-zaremba ? Seems like we should just emit a warning until the next release.

amaury1093 · 2021-07-02T15:40:21Z

Also, are we rolling back the panic on empty minimum gas prices which caused the upgrade failures @anilcse @robert-zaremba ? Seems like we should just emit a warning until the next release.

this just got merged

robert-zaremba · 2021-07-02T16:48:01Z

Let's start with option 1. - full backup controlled by a flag.
And later we will use snapshot mechanism.

robert-zaremba · 2021-07-02T16:49:27Z

BTW: some upgrades can take a lot of memory (eg if we were to update all bank managed storage) if we keep the cash wrapper.

robert-zaremba · 2021-07-30T14:33:05Z

Oh, @aaronc - I've just noticed that you added "Cosmovisor" to the title and label. I think it's not fully related to the cosmovisor, however it will be partially solved with:

And fully solved with ADR-40 snapshots.

aaronc · 2021-07-30T15:08:26Z

Right, sorry about that.

## Description Ref: #9616 (comment) depends: #8590  This PR adds a full backup option for cosmovisor. `UNSAFE_SKIP_BACKUP` is an `env` setting introduced newly. - if `false` (default, **recommended**), cosmovisor will try to take backup and then upgrade. In case of failure while taking backup, it will just halt the process there and won't try the upgrade. - If `true`, the cosmovisor will try to upgrade without any backup. This setting makes it hard to recover from a failed upgrade. Node operators either need to sync from a healthy node or use a snapshot from others. --- ### Author Checklist *All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.* I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] added `!` to the type prefix if API or client breaking change - [ ] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [x] provided a link to the relevant issue or specification - [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] added a changelog entry to `CHANGELOG.md` - [x] included comments for [documenting Go code](https://blog.golang.org/godoc) - [x] updated the relevant documentation or specification - [x] reviewed "Files changed" and left comments if necessary - [ ] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable)

tac0turtle · 2022-10-21T16:08:56Z

closing as this is touched on in the current store discussions

robert-zaremba · 2022-10-25T23:12:11Z

The issue is still present, so I think we should keep it open until resolved.

robert-zaremba added C:Store C:x/upgrade C:Cosmovisor Issues and PR related to Cosmovisor labels Jun 30, 2021

robert-zaremba assigned aaronc and anilcse Jun 30, 2021

robert-zaremba self-assigned this Jun 30, 2021

aaronc changed the title ~~Atomic migrations~~ Atomic Cosmovisor migrations Jul 2, 2021

ryanchristo mentioned this issue Jul 2, 2021

docs: add chain upgrade guide v0.43 #9567

Merged

13 tasks

ryanchristo added this to the cosmovisor v1.0 milestone Jul 7, 2021

ryanchristo added Status: Backlog and removed Status: Backlog labels Jul 7, 2021

anilcse mentioned this issue Jul 8, 2021

feat: Add backup option for cosmovisor #9652

Merged

19 tasks

robert-zaremba changed the title ~~Atomic Cosmovisor migrations~~ Atomic migrations Jul 30, 2021

robert-zaremba removed this from the cosmovisor v0.1 milestone Jul 30, 2021

robert-zaremba added this to the v0.44 milestone Jul 30, 2021

github-actions bot added the stale label Sep 14, 2021

github-actions bot closed this as completed Sep 21, 2021

amaury1093 added pinned and removed stale labels Sep 21, 2021

amaury1093 reopened this Sep 21, 2021

tac0turtle added this to COSMOS SDK: Store v2/ SMT Jan 21, 2022

tac0turtle added the WG: Store V2/SMT label May 8, 2022

tac0turtle added this to Cosmos-SDK May 8, 2022

tac0turtle added R:0.46 and removed pinned labels May 9, 2022

tac0turtle moved this to 📝 Todo in Cosmos-SDK May 12, 2022

tac0turtle removed the R:0.46 label May 27, 2022

tac0turtle removed this from the v0.46 milestone May 27, 2022

tac0turtle unassigned aaronc, robert-zaremba and anilcse May 27, 2022

tac0turtle closed this as completed Oct 21, 2022

tac0turtle moved this to Done in COSMOS SDK: Store v2/ SMT Oct 21, 2022

Repository owner moved this from 📝 Todo to 👏 Done in Cosmos-SDK Oct 21, 2022

tac0turtle removed this from Cosmos-SDK Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic migrations #9616

Atomic migrations #9616

robert-zaremba commented Jun 30, 2021 •

edited

Loading

robert-zaremba commented Jun 30, 2021

robert-zaremba commented Jun 30, 2021

alexanderbez commented Jun 30, 2021

aaronc commented Jun 30, 2021

zmanian commented Jul 2, 2021

alexanderbez commented Jul 2, 2021

zmanian commented Jul 2, 2021

alexanderbez commented Jul 2, 2021 •

edited

Loading

anilcse commented Jul 2, 2021

anilcse commented Jul 2, 2021 •

edited

Loading

alexanderbez commented Jul 2, 2021

aaronc commented Jul 2, 2021

aaronc commented Jul 2, 2021

amaury1093 commented Jul 2, 2021

robert-zaremba commented Jul 2, 2021

robert-zaremba commented Jul 2, 2021

robert-zaremba commented Jul 30, 2021 •

edited

Loading

aaronc commented Jul 30, 2021

tac0turtle commented Oct 21, 2022

robert-zaremba commented Oct 25, 2022

Atomic migrations #9616

Atomic migrations #9616

Comments

robert-zaremba commented Jun 30, 2021 • edited Loading

Summary

Problem Definition

Proposal

For Admin Use

robert-zaremba commented Jun 30, 2021

robert-zaremba commented Jun 30, 2021

alexanderbez commented Jun 30, 2021

aaronc commented Jun 30, 2021

zmanian commented Jul 2, 2021

alexanderbez commented Jul 2, 2021

zmanian commented Jul 2, 2021

alexanderbez commented Jul 2, 2021 • edited Loading

anilcse commented Jul 2, 2021

anilcse commented Jul 2, 2021 • edited Loading

alexanderbez commented Jul 2, 2021

aaronc commented Jul 2, 2021

aaronc commented Jul 2, 2021

amaury1093 commented Jul 2, 2021

robert-zaremba commented Jul 2, 2021

robert-zaremba commented Jul 2, 2021

robert-zaremba commented Jul 30, 2021 • edited Loading

aaronc commented Jul 30, 2021

tac0turtle commented Oct 21, 2022

robert-zaremba commented Oct 25, 2022

robert-zaremba commented Jun 30, 2021 •

edited

Loading

alexanderbez commented Jul 2, 2021 •

edited

Loading

anilcse commented Jul 2, 2021 •

edited

Loading

robert-zaremba commented Jul 30, 2021 •

edited

Loading