Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pre-upgrade command in Cosmovisor #9973

Closed
aaronc opened this issue Aug 19, 2021 · 15 comments · Fixed by #10056
Closed

Support for pre-upgrade command in Cosmovisor #9973

aaronc opened this issue Aug 19, 2021 · 15 comments · Fixed by #10056
Assignees
Labels
C:Cosmovisor Issues and PR related to Cosmovisor

Comments

@aaronc
Copy link
Member

aaronc commented Aug 19, 2021

Ref:

As described in #9692 (comment):

To summarize, we will have convention where cosmovisor executes a $ pre-upgrade command (note, this command doesn't exist in the SDK -- the application defines it) prior to starting the app. We then evaluate the following exit status codes:

0: success; continue
1: assume command does not exist; continue
30: assume command exists but failed; fail
31: assume command exists but we should retry; keep retrying until 0 or 30 is returned

I would also add that pre-upgrade handler should basically look like what we're doing now for "pre-upgrades" of the multi-store: https://github.com/cosmos/gaia/blob/main/app/app.go#L428-L439. It would check for an upgrade name and skip-heights. So the pre-upgrade command will need to take --unsafe-skip-upgrades just like start does.

Also, it's an open question whether pre-upgrade should read the upgrade info file off the disk or get that passed to it by cosmovisor.

I think we should document how to do this in the SDK and provide any known migrations needed for things like app.toml in the SDK, but I don't think we can really generalized this pre-upgrade command. Apps will need to implement it for the upgrades they have planned.

@robert-zaremba
Copy link
Collaborator

solution1: #9692 (comment)

@robert-zaremba
Copy link
Collaborator

Alternative Design

Instead of setting the upgrade process fully in the app, I propose to make it more flexible by extending the UpgradeInfo structure by adding PreUpgrade attribute:

type UpgradeInfo struct {
    Name string
    Info string
    PreUpgrade PreUpgradeInstructions
}

This is how the the PreUpgradeInstructions can look like:

{
  run: "script.sh" | "appd pre-upgrade  more-options",
  download: {
     "linux/amd64": "https://github.com/cosmos/cosmos-sdk/raw/master/cosmovisor/testdata/repo/chain3-zip_dir/autod.zip?checksum=sha256:8951f52a0aea8617de0ae459a20daf704c29d259c425e60d520e363df0f166b4"
  },
  description: "some more information about the upgrade process for devops"
}
  • run will tell cosmovisor (or an admin) what to run. It can be a shell command (if we support only POSIX shells), or an updated app binary, as in the original solution.
  • download record is optional has the same as the binaries we use to encode binaries to download in UpdateInfo.Info.
  • description: optional, more information for devops.

NOTE: we use a concrete type instead of string to be specific about the content.

This solution will solve other use-cases where we will need more advanced upgrade scenarios (eg moving files or downloading external libraries).

We can implement it in phases - and only start with run, and then ad download and description.

I was presenting this solution during the Cosmovisor WG call last Friday and got ack as a superior then the "basic" one proposed originally.

CAVEAT: if we agree to use this proposal as the solution for #9894 then we will need to include a small update in the upcoming v0.44 security release to enable the UpgradeInfo extension.

@marbar3778 , @alexanderbez , @alessio -- looking for your feedback so we can quickly push it and combine with the v0.44 if we agree so.

@robert-zaremba
Copy link
Collaborator

Another way of handling it would be to overload the existing UpgradeInfo.Info attribute. We can handle multiple versions of it's content in the cosmovisor. However we won't have a nice struct validation in the app.

@robert-zaremba
Copy link
Collaborator

robert-zaremba commented Aug 30, 2021

Important note: it seams we can change the UpgradeInfo even so it's part of the protobuf. We can do that because it's not a part of the Msg server nor a Query server. However it's a part of the legacy router (which we are deprecating). So the extending Msg restriction doesn't apply here.

@robert-zaremba
Copy link
Collaborator

@jgimeno , @spoo-bar , @tychoish , @alexanderbez -- here is the draft implementation: #10032

@robert-zaremba robert-zaremba self-assigned this Aug 30, 2021
@tac0turtle
Copy link
Member

The approach @robert-zaremba outlined seems like a good approach.

@spoo-bar
Copy link
Contributor

Now that #10032 also allows for post upgrade scripts, maybe the same exit status codes could be used for post-upgrade also?

0: success; continue
1: assume command does not exist; continue
30: assume command exists but failed; fail
31: assume command exists but we should retry; keep retrying until 0 or 30 is returned

But regarding 30, since the upgradation of binary is already done, saying the upgrade failed doesn't sound right.
Is there a better approach to handle this case? Should the upgrade and postupgrade be atomic?

@anilcse
Copy link
Collaborator

anilcse commented Aug 31, 2021

Alternative Design

Instead of setting the upgrade process fully in the app, I propose to make it more flexible by extending the UpgradeInfo structure by adding PreUpgrade attribute:

type UpgradeInfo struct {
    Name string
    Info string
    PreUpgrade PreUpgradeInstructions
}

This is how the the PreUpgradeInstructions can look like:

{
  run: "script.sh" | "appd pre-upgrade  more-options",
  download: {
     "linux/amd64": "https://github.com/cosmos/cosmos-sdk/raw/master/cosmovisor/testdata/repo/chain3-zip_dir/autod.zip?checksum=sha256:8951f52a0aea8617de0ae459a20daf704c29d259c425e60d520e363df0f166b4"
  },
  description: "some more information about the upgrade process for devops"
}
  • run will tell cosmovisor (or an admin) what to run. It can be a shell command (if we support only POSIX shells), or an updated app binary, as in the original solution.
  • download record is optional has the same as the binaries we use to encode binaries to download in UpdateInfo.Info.
  • description: optional, more information for devops.

NOTE: we use a concrete type instead of string to be specific about the content.

This solution will solve other use-cases where we will need more advanced upgrade scenarios (eg moving files or downloading external libraries).

We can implement it in phases - and only start with run, and then ad download and description.

I was presenting this solution during the Cosmovisor WG call last Friday and got ack as a superior then the "basic" one proposed originally.

CAVEAT: if we agree to use this proposal as the solution for #9894 then we will need to include a small update in the upcoming v0.44 security release to enable the UpgradeInfo extension.

@marbar3778 , @alexanderbez , @alessio -- looking for your feedback so we can quickly push it and combine with the v0.44 if we agree so.

ACK, I believe we can keep download information in the script as well. Coz, what needs to be done with downloaded file might be a question for Devs. Some might be interested to execute the binary, some might need to move it to specific directory etc.

@robert-zaremba
Copy link
Collaborator

ACK, I believe we can keep download information in the script as well.

@anilcse - to some extend. The idea is that we firstly download the script (if download is not empty) and then we execute it. We don't want to put an ugly and long script in the run attribute with rules for different OSes.

@anilcse
Copy link
Collaborator

anilcse commented Aug 31, 2021

ACK, I believe we can keep download information in the script as well.

@anilcse - to some extend. The idea is that we firstly download the script (if download is not empty) and then we execute it. We don't want to put an ugly and long script in the run attribute with rules for different OSes.

Understood, that sounds good 👍

@spoo-bar
Copy link
Contributor

@robert-zaremba Any ideas on how cosmovisor should handle errors by post upgrade scripts?
The binary has already been upgraded, so cosmovisor cant say the upgrade failed.

@robert-zaremba
Copy link
Collaborator

Decision for the v0.44 release.

Since the Alternative Design (solution 2) requires a small proto update (extending the x/upgrade Plan structure) we don't want to rush it include it into v0.44.0. Instead:

The advantages of the solution-2 is that:

  • we don't need to implement ops responsibility in the app binary
  • we provide more flexibility for different setups / chains.
  • It goes in line with a broader

@robert-zaremba
Copy link
Collaborator

Any ideas on how cosmovisor should handle errors by post upgrade scripts?

@spoo-bar - let's don't do post upgrade in this PR / issue. I suggest to continue the design discussion in #10047

Also for the future communication, we need to agree, if we want to present the outlined solution from this PR (app pre-upgrade) as a long term one or as a temporal one (note: with my proposal it's still possible to reuse app pre-upgrade command).

@alessio
Copy link
Contributor

alessio commented Sep 2, 2021

I was presenting this solution during the Cosmovisor WG call last Friday and got ack as a superior then the "basic" one proposed originally.

What makes this design superior to the "basic"? Instead of a sketched design, I'd appreciate it if you could provide the following:

  • use case
  • list of acceptance criteria (preferable in the Given/When/Then language)

Thanks!

@robert-zaremba
Copy link
Collaborator

Here is a discussion for the next cosmovisor improvements: #9928

We didn't deprecate the app pre-upgrade concept. We were rather thinking of upgrading the x/upgrade Plan / UpdateInfo structure to instrument all kind of tools responsible for handling updates. This way we have more generic solutions with instructions for upgrade as well as other features planned for future cosmovisor versions (eg library swaps).
So instead of hardcoding a rule in cosmovisor of "always try to run app pre-upgrade", this instruction will be part of the upgrade plan.

Acceptance criteria for the suggested Plan upgrade is:

  • given upgrade plan, a tool developer or a dev ops have all information for doing a successful chain upgrade
  • given upgrade plan, tool developers can independently design other orchestration tool
  • given upgrade plan, when more complex upgrade is needed, a proposal can include more instructions (eg aforementioned library swap).

We were discussing it during the WG session: https://hackmd.io/jeYE8fwDTlCdFhMYUW9mpQ

Hope this clarifies your confusions.

mergify bot pushed a commit that referenced this issue Sep 23, 2021
<!--
The default pull request template is for types feat, fix, or refactor.
For other templates, add one of the following parameters to the url:
- template=docs.md
- template=other.md
-->

## Description

Ref: #9973, #10056 

<!-- Add a description of the changes that this PR introduces and the files that
are the most critical to review. -->

* Skipping backup when skipping upgrade at specified height
* adding env var `PREUPGRADE_MAX_RETRIES` which specifies the limit on how many times to retry preupgrade on error code `31`

---

### Author Checklist

*All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.*

I have...

- [ ] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] added `!` to the type prefix if API or client breaking change
- [ ] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [ ] provided a link to the relevant issue or specification
- [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules)
- [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] added a changelog entry to `CHANGELOG.md`
- [ ] included comments for [documenting Go code](https://blog.golang.org/godoc)
- [ ] updated the relevant documentation or specification
- [ ] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed

### Reviewers Checklist

*All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.*

I have...

- [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] confirmed `!` in the type prefix if API or client breaking change
- [ ] confirmed all author checklist items have been addressed 
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:Cosmovisor Issues and PR related to Cosmovisor
Projects
None yet
6 participants