-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terraform: fix security rule reconciliation on Azure #3454
Conversation
✅ Deploy Preview for constellation-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
No, if the posted CI runs succeeds this solution is cleaner in general and is likely to cause less problems in the future. |
1a3a5a6
to
ba7c6d4
Compare
I'm not sure where the best place for the migration script would be (see Additional Information in the description). |
44f57f9
to
ad268a3
Compare
echo "CLI VERSION:" | ||
echo $(./build/constellation version) | ||
CLI=$(realpath ./build/constellation) | ||
bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old code was broken because an empty string "" as substitution for a flag breaks the flag parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, the upgrade CLI (with potentially simulated patch version) should be used. Previously, the target embedded its own CLI that was always tagged with vNEXT-pre. This doesn't work for simulating a patch upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work for simulating a patch upgrade.
I thought the idea of simulatedTargetVersion
is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion
works as intended. We should either fix this or remove it.
Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion
here and remove the flag again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to keep the explicit cli passing, such that the identical CLI is used for "config migrate" and the rest of the upgrade workflow
@@ -297,10 +297,10 @@ func getCLIPath(cliPathFlag string) (string, error) { | |||
pathCLI := os.Getenv("PATH_CLI") | |||
var relCLIPath string | |||
switch { | |||
case pathCLI != "": | |||
relCLIPath = pathCLI | |||
case cliPathFlag != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a flag should take precedence over ENV
ad268a3
to
2777da4
Compare
If you are not already on it, changing PR from draft to open is a good point to actually merge the fixup commits and present a neat commit history. |
3ffed5a
to
884bd3e
Compare
2334a78
to
f887c8e
Compare
f887c8e
to
9cf08db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also document the not necessarily required but at least recommended migration steps in docs/docs/reference/migration.md
echo "CLI VERSION:" | ||
echo $(./build/constellation version) | ||
CLI=$(realpath ./build/constellation) | ||
bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work for simulating a patch upgrade.
I thought the idea of simulatedTargetVersion
is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion
works as intended. We should either fix this or remove it.
Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion
here and remove the flag again.
…uired due to inline version.txt manipulation This reverts commit 9cf08db.
As discussed with @3u13r, we decided to build the target CLI inside the upgrade job itself, because separating this created several consistency issues for the simulated patch version scenario Moreover, the extra job didn't save any time because the CLI is built inside the bazel upgrade target too. |
ca8f679
to
2b6db25
Compare
docs/docs/reference/migration.md
Outdated
|
||
### Azure | ||
|
||
* During the upgrade, security rules are migrated and the old ones are recommended to be cleaned up manually by the user. This step is recommended but not strictly required. The below script shows how to programatically delete the old rules through the Azure CLI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to not make this optional, because we could not change the security rule name back and remove the priority offset in Terraform otherwise.
|
||
### Azure | ||
|
||
* During the upgrade, security rules are migrated and the old ones need to be cleaned up manually by the user. The below script shows how to delete them through the Azure CLI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once approved, I will backport this to the v2.19 docs
41313fd
to
9f47e9a
Compare
9f47e9a
to
14d3abc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- To keep using an existing UAMI, add the `Owner` permission with the scope of your `resourceGroup`. | ||
- Otherwise, simply [create new Constellation IAM credentials](../workflows/config.md#creating-an-iam-configuration) and use the created UAMI. | ||
- To migrate the authentication for an existing cluster on Azure to an UAMI with the necessary permissions: | ||
* The `provider.azure.appClientID` and `provider.azure.appClientSecret` fields are no longer supported and should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we currently enforce any rules regarding *
vs -
. They seem to be interchangeable. Did your linter change this, or do you have a reason for this preference? I know that docs currently favor *
(i.e. are used more). I think we can revert this for now to not have the last commit that changed the line updated and revisit this discussion later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed on markdownlint, which (unfortunately) picks the character of the topmost list item on the page. So if you wanted to keep the diff small but still use the linter, you would replace the first item with -
in this case.
Upgrade test passed again: https://github.com/edgelesssys/constellation/actions/runs/11615051562 |
* fix security rule reconciliation on azure * fix simulated patch version upgrade
Context
In #3257, the migration to defining the network security rules as separate resources was started.
Unfortunately, the concurrent usage of inline and security rule resources causes problems (see https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_security_group). Terraform cannot resolve which declarative state of the rules should be applied if both inline and resources exist, thus making it a concurrency race which resource overwrites the others.
This bug was first observed as part of the weekly Azure TDX Terraform provider test, but it did indeed also fail during the release test with the CLI (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136903).
The observed behavior was that the Kube API server was not reachable anymore, because the network security rule was missing.
This bug applies also to Azure SEV-SNP, but since the bug is a concurrency race, its manifestation is not deterministic. In fact, the correct behavior happened to occur during the release test on SEV-SNP (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136329#step:19:874).
The problem can occur on any cluster on Azure and every
terraform apply
could cause this bug.Proposed change(s)
By leaving the inline security rules block of the network security group empty, this resource does not reconcile the rules.
Thus, the solution is to fully migrate to separate resources for the security rules.
I propose to do a patch release where users from 2.18 are required to upgrade to 2.19.1 directly.
Additional info
I tested setting up a new cluster with this change and subsequent applies didn’t result in any terraform diff as expected.
The cluster remained reachable.
Upgrade test from v2.18.0 -> v2.19.1
✅ Azure TDX
✅ Azure SEV-SNP
For upgrades from v2.18.0 on Azure, the old security rules are orphaned and should be manually cleaned up by the user:
Checklist