Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release post-mortem #274

Open
Veetaha opened this issue Oct 5, 2023 · 21 comments
Open

Release post-mortem #274

Veetaha opened this issue Oct 5, 2023 · 21 comments
Labels
A-infra Area: Infrastructure and CI magic :sparkles: C-discussion Category: Discussion / Design decision. See also rust-marker/design

Comments

@Veetaha
Copy link
Contributor

Veetaha commented Oct 5, 2023

0.3.0 release was done, but there were a number of problems with it.

  • The crates-io publishing job failed. I wonder why is that. It managed to publish part of the crates, but choked on marker_utils complaining that the publishing identity doesn't have rights to publish that package. @xFrednet could you elaborate how that could happen and what measures were taken to remediate this?
  • The release CI job couldn't push commits to master due to the merge queue enabled on the repo. I didn't test this process with a merge queue enabled in the repo, and I see that it was fixed. git push --force-with-lease didn't resolve the issue and should be removed from the CI template. I suppose you tweaked the repo settings somehow to unblock this?
@Veetaha Veetaha added the S-needs-triage Status: This issue needs triage label Oct 5, 2023
@xFrednet
Copy link
Member

xFrednet commented Oct 5, 2023

The crates-io publishing job failed.

I forgot to add the rust-marker/crate-owners group to the marker_utils crate. 😅 That's fixed now. I decided to push the remaining packages myself for this release, as I was uncertain how cargo release would handle pushing crate versions that have already been published.

I suppose you tweaked the repo settings somehow to unblock this?

I tried several configurations, but it seems like the "Require pull request", "Require status check" and "Require merge queues" block pushes completely, even force pushes. One reason could be, that rust-marker-ci only has write access and is not registered as a repo admin. On the other hand, I've listed rust-marker-ci as a user that's allowed to force push.

Anyways, my non-elegant solution was to just remove those branch protection options completely, run the action and then add them back. Not sure how to solve this properly. Maybe we have to go through PRs and wait for them to be manually merged as well? 🤔


There are two more things I've noticed:

  1. The "🚧 Development v0.4.0-dev" commit had a red CI. This actually makes sense, since the "GitHub Action" test tries to fetch the binaries from release, but they haven't been compiled yet.

    I simply rerun the action, once all binaries were uploaded. This made master green again. I don't see a simple way to solve this. We can't really wait with the pushes, until the binaries are online, since the binaries are build with those commits. Rerunning the CI was easy, just something to keep in mind :)

  2. Not related to the github release process, but more releases in general. I noticed a bug in marker_uitest shortly before the release, while playing around with the marker-example-lints repo. I think it would be good to define some smoke tests (And maybe partially automate them) to run before every release. It could be a simple:

    • Check that the ui_tests of Marker's lint-crate-template pass after a clean build
    • Check Marker doesn't crash on common crates, like tokio, syn, clap, bevy, etc...

I still think this process went pretty well. It was cool to see how everything worked, once the tags were pushed. I've already updated the template repo, to use the new CI as well: rust-marker/lint-crate-template#5

This has been a fantastic release, in big part due to your help. ❤️

On a related note, would you like to join the rust-marker organization? This would come with no additional responsibilities, it might just make cooperation a bit easier. I've emailed you on your commit email, a couple of days ago. This is not meant to put any pressure on you, just thought I'd mention it here, in case you don't check that email anymore :)

@xFrednet xFrednet added A-infra Area: Infrastructure and CI magic :sparkles: C-discussion Category: Discussion / Design decision. See also rust-marker/design and removed S-needs-triage Status: This issue needs triage labels Oct 5, 2023
@Veetaha
Copy link
Contributor Author

Veetaha commented Oct 5, 2023

The "🚧 Development v0.4.0-dev" commit had a red CI

This problem is easily solvable with making 404 errors retriable. We could have an optional env variable for the installation script that configures the duration of retries. We'll set that to a much higher value on our CI that would give the CI job say 30 minutes of retries before it gives up and fails the CI.


Maybe we have to go through PRs and wait for them to be manually merged as well? 🤔

That may be too complex to implement for little benefit. Why complex? Because this extends the path that the release needs to go through. We can't push the commits and tag them in the same CI job, and it would also increase the risk of release process failing in the middle and leaving the repo in a half-consistent state (actually, I would, for example, like to return a single git push --atomic into the release steps on CI, I separated the branch and tag pushes because I wanted to avoid accidental overwrites of master, but with --force-with-lease as you showed that makes it much safer).
It would also make the release process longer since you need to wait for the PR checks to pass as well...
So I would like to avoid going through a PR for the release process.

I suggest disabling the merge queue requirement. If there is a way to disable it only for rust-marker-ci, then that would be perfect, if not, I don't think that's going to be a big problem since to be able to push to master you need to have write access to the repo, and only a few people have that.

I think it would be good to define some smoke tests

I'd start with adding the smoke tests you mentioned into the PR CI. I believe they should be cheap enough for this. Remember that GitHub gives us loads of compute time for free, we may want to squeeze all of that unless such smoke tests would run on the critical path of merging the PR (i.e. be the longest job on CI). If the CI time becomes unacceptable the heavy jobs should be moved to a later stage, but I don't think we are close to any of multi-hour tests at this point yet, so let's enjoy this while we can.

in case you don't check that email anymore

You're right, I don't check that email usually. Thanks for the invite I don't see where to accept it. I'll try to dedicate some time to marker, but won't promise anything. I'm currently quite busy on work trying to deliver some features as an AWS partner for AWS re:Invent scheduled for the end of November, but hopefully not 24/7, so will see.

@xFrednet
Copy link
Member

xFrednet commented Oct 5, 2023

We could have an optional env variable for the installation script that configures the duration of retries.

That might be a good and simple solution :)

So I would like to avoid going through a PR for the release process.

Yeah, I agree with that reasoning. There should be a setting to allow force pushes, my guess is, that I just didn't configure it correctly 🤔

I suggest disabling the merge queue requirement.

Sure, we can test that 👍

You're right, I don't check that email usually. Thanks for the invite I don't see where to accept it. I'll try to dedicate some time to marker, but won't promise anything. I'm currently quite busy on work trying to deliver some features as an AWS partner for AWS re:Invent scheduled for the end of November, but hopefully not 24/7, so will see.

Happy to hear that! It was just a message asking if you want to join, now I sent you the official invite. If you have some time for Marker, it would be highly appreciated, but no pressure. You already invested a lot! And that sounds like a significant project, congratulations :D

@xFrednet
Copy link
Member

I'm currently looking through the settings, and found this setting.

image

I've now added rust-marker-ci to allow it to bypass the PR rule. I've also changed the merge strategy to be "squash and merge". Let's see how that feels for some time.

@Veetaha
Copy link
Contributor Author

Veetaha commented Oct 12, 2023

Looks much better to me
image

Although if you look under the covers, then you'll see a dump of hard work 😸. It tells me that GitHub definitely doesn't know how their own product is used even if it is dogfooded. But I don't care, I still think this is a lesser evil.

image

@xFrednet
Copy link
Member

It's sad that the PR description is lost. Especially in Clippy's changelog PRs, I always have a lot of fun with that :D.

Regardless, I agree that it looks nice 👍

@xFrednet
Copy link
Member

It looks like the --force-push and --atomic thing wasn't the right solution either:

https://github.com/rust-marker/marker/actions/runs/6888949297/job/18738942106

Oh well, I'll temporary disable the rules for the release again

@Veetaha
Copy link
Contributor Author

Veetaha commented Nov 16, 2023

Just replace --force-with-lease with --force. --force-with-lease doesn't work for tags, because --force-with-lease thinks that the tag that we have locally that was changed is out of date, when it's actually intentionally different from remote

@xFrednet
Copy link
Member

The crates.io push failed again, apparently I forgot to add our CI to marker_uitest as well 🤦

One more error came up, regarding the retry count: https://github.com/rust-marker/marker/actions/runs/6889225423/job/18739805576

Would you mind taking a look at that?

@Veetaha
Copy link
Contributor Author

Veetaha commented Nov 16, 2023

Yes, here is a quick fix #313

@Veetaha
Copy link
Contributor Author

Veetaha commented Nov 16, 2023

Btw. did something change in the merge settings or did github break something? I see that merge commits are created:
image

@xFrednet
Copy link
Member

Ahh, it probably changed, when I unprotected the branch for the release, I'll fix it 👍

@xFrednet xFrednet reopened this Nov 16, 2023
@xFrednet
Copy link
Member

xFrednet commented Nov 16, 2023

(Didn't mean to close it, wrong button)

... and the merges are fixed again

@Veetaha
Copy link
Contributor Author

Veetaha commented Nov 16, 2023

This fixed the problem, driver build on ARM succeeded. Also, cargo-release failed to publish because all crates are already published. I suppose you published them manually, but if you fixed the crates.io permissions then that workflow would've published the missing crates during a retry.

@xFrednet
Copy link
Member

Thank you for the quick fix! :D

I saw that it skipped the released crated, that's good to know for the next time. Though, the permissions should be right for the next time.

@xFrednet
Copy link
Member

Btw. I've noticed that the install scripts CI on master is failing after a release: https://github.com/rust-marker/marker/actions/runs/6984551418/job/19007538285

A simple rerun after the binaries are available fixes it. I guess we can also set the environment value here, that we use to make the action script retry multiple times 🤔

@xFrednet
Copy link
Member

Oh and the crates.io step failed again. This time with a login error 🤔

Updating crates.io index
    Packaged 25 files, 119.3KiB (30.2KiB compressed)
   Uploading cargo_marker v0.4.1 (/home/runner/work/marker/marker/cargo-marker)
error: failed to publish to registry at https://crates.io/

Caused by:
  the remote server responded with an error (status 403 Forbidden): must be logged in to perform that action
Error: Process completed with exit code 101

https://github.com/rust-marker/marker/actions/runs/6984551469/job/19007537021

It's kind of weird, since the CI looks the same, for this and the previous command:

- run: cargo release publish --exclude cargo_marker --execute --no-confirm --no-verify --allow-branch '*'
env:
CARGO_REGISTRY_TOKEN: ${{ secrets.CRATES_IO_TOKEN }}
- run: cargo release publish --package cargo_marker --execute --no-confirm --no-verify --allow-branch '*'
env:
CARGO_REGISTRY_TOKEN: ${{ secrets.CRATES_IO_TOKEN }}

I've now manually published cargo_marker v0.4.1

@xFrednet
Copy link
Member

Just running the release CI without any changes to the branch protection still fails:

https://github.com/rust-marker/marker/actions/runs/6989462105/job/19018026259

These are the settings for the branch protection

image

Even after disabling the merge queue it still fails:

https://github.com/rust-marker/marker/actions/runs/6989472085/job/19018045540

I've now changed the branch pattern, with the disabled merge queue, to basically disable all branch protection. That works and saves me from re-enabling all requirements.

https://github.com/rust-marker/marker/actions/runs/6989476868


Just an idea, if we can't figure out, why pushing to master fails, we could maybe have the CI create a new branch, like release-4.2.1 which the commits and tags are pushed to. Then we would manually merge that branch on to master 🤔 It's not ideal, but might be a workaround.

@xFrednet
Copy link
Member

Grrr, I messed up with the v0.5.0 release by setting the changelog headline to [v0.5.0] instead of [0.5.0] this triggered a release of v0.4.3, which is really bad. I basically broke everyone relaying on v0.4.3... Really sorry about this. Is there a way that we can prevent force pushing on these tags, while keeping the v0.x shifting one?

I'll repush the tags for the proper v0.4.3 release. That's not nice, but I think it's better than leaving it this broken. (I'm still a bit mad at myself)

@Veetaha
Copy link
Contributor Author

Veetaha commented Dec 28, 2023

Yeah, there can be a check in the release scripts that inspects the version tag the "release" workflow was run with. It should fail if the version tag already exists and uses a different commit sha.

It's not a big of a deal, for now you can just re-run the 0.4.3 release from its hotfix branch and it will re-build the binaries and update the tags. The release flow just needs some hardening for such mistakes

@xFrednet
Copy link
Member

Everything should be back in order now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-infra Area: Infrastructure and CI magic :sparkles: C-discussion Category: Discussion / Design decision. See also rust-marker/design
Projects
None yet
Development

No branches or pull requests

2 participants