terraform: fix security rule reconciliation on Azure #3454

elchead · 2024-10-22T13:53:15Z

Context

In #3257, the migration to defining the network security rules as separate resources was started.
Unfortunately, the concurrent usage of inline and security rule resources causes problems (see https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_security_group). Terraform cannot resolve which declarative state of the rules should be applied if both inline and resources exist, thus making it a concurrency race which resource overwrites the others.

This bug was first observed as part of the weekly Azure TDX Terraform provider test, but it did indeed also fail during the release test with the CLI (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136903).
The observed behavior was that the Kube API server was not reachable anymore, because the network security rule was missing.
This bug applies also to Azure SEV-SNP, but since the bug is a concurrency race, its manifestation is not deterministic. In fact, the correct behavior happened to occur during the release test on SEV-SNP (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136329#step:19:874).

The problem can occur on any cluster on Azure and every terraform apply could cause this bug.

Proposed change(s)

By leaving the inline security rules block of the network security group empty, this resource does not reconcile the rules.
Thus, the solution is to fully migrate to separate resources for the security rules.

I propose to do a patch release where users from 2.18 are required to upgrade to 2.19.1 directly.

remove the inline network security rules on Azure

Additional info

I tested setting up a new cluster with this change and subsequent applies didn’t result in any terraform diff as expected.
The cluster remained reachable.

Upgrade test from v2.18.0 -> v2.19.1
✅ Azure TDX
✅ Azure SEV-SNP

For upgrades from v2.18.0 on Azure, the old security rules are orphaned and should be manually cleaned up by the user:

#!/bin/bash
name="<insert>"  # the name provided in the config
id="<insert>"    # the cluster id

rules=(
  "kubernetes"
  "bootstrapper"
  "verify"
  "recovery"
  "join"
  "debugd"
  "konnectivity"
)

for rule in "${rules[@]}"; do
  echo "Deleting rule: $rule"
  az network nsg rule delete \
    --resource-group "$name-rg" \
    --nsg-name "$name-$id" \
    --name "$rule"
done

echo "All specified rules have been deleted."

Checklist

Run the E2E tests that are relevant to this PR's changes
Update docs
Add labels (e.g., for changelog category)
Is PR title adequate for changelog?
Link to Milestone

netlify · 2024-10-22T13:53:43Z

✅ Deploy Preview for constellation-docs ready!

Name	Link
🔨 Latest commit	`e31af63`
🔍 Latest deploy log	https://app.netlify.com/sites/constellation-docs/deploys/6723ac5a6150500008359a35
😎 Deploy Preview	https://deploy-preview-3454--constellation-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

elchead · 2024-10-22T14:22:42Z

@3u13r I understand that your intention for the dangling konnectivity rule in #3257 was to enforce that the other rules are deleted, but this should not be necessary afais. Or do you see any issue?

3u13r · 2024-10-22T14:25:38Z

I understand that your intention for the dangling konnectivity rule in #3257 was to enforce that the other rules are deleted, but this should not be necessary afais. Or do you see any issue?

No, if the posted CI runs succeeds this solution is cleaner in general and is likely to cause less problems in the future.

elchead · 2024-10-24T10:03:23Z

I'm not sure where the best place for the migration script would be (see Additional Information in the description).

elchead · 2024-10-24T10:07:29Z

.github/workflows/e2e-upgrade.yml

+          echo "CLI VERSION:"
+          echo $(./build/constellation version)
+          CLI=$(realpath ./build/constellation)
+          bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI"


the old code was broken because an empty string "" as substitution for a flag breaks the flag parsing

Furthermore, the upgrade CLI (with potentially simulated patch version) should be used. Previously, the target embedded its own CLI that was always tagged with vNEXT-pre. This doesn't work for simulating a patch upgrade.

This doesn't work for simulating a patch upgrade.

I thought the idea of simulatedTargetVersion is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion works as intended. We should either fix this or remove it.

Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion here and remove the flag again.

I decided to keep the explicit cli passing, such that the identical CLI is used for "config migrate" and the rest of the upgrade workflow

elchead · 2024-10-24T10:10:28Z

e2e/internal/upgrade/upgrade.go

@@ -297,10 +297,10 @@ func getCLIPath(cliPathFlag string) (string, error) {
 	pathCLI := os.Getenv("PATH_CLI")
 	var relCLIPath string
 	switch {
-	case pathCLI != "":
-		relCLIPath = pathCLI
 	case cliPathFlag != "":


a flag should take precedence over ENV

3u13r · 2024-10-24T12:20:30Z

If you are not already on it, changing PR from draft to open is a good point to actually merge the fixup commits and present a neat commit history.

3u13r

We should also document the not necessarily required but at least recommended migration steps in docs/docs/reference/migration.md

3u13r · 2024-10-25T08:48:10Z

.github/workflows/e2e-upgrade.yml

+          echo "CLI VERSION:"
+          echo $(./build/constellation version)
+          CLI=$(realpath ./build/constellation)
+          bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI"


This doesn't work for simulating a patch upgrade.

I thought the idea of simulatedTargetVersion is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion works as intended. We should either fix this or remove it.

Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion here and remove the flag again.

…uired due to inline version.txt manipulation This reverts commit 9cf08db.

elchead · 2024-10-25T11:03:44Z

As discussed with @3u13r, we decided to build the target CLI inside the upgrade job itself, because separating this created several consistency issues for the simulated patch version scenario
The main reason being that the upgrade tool itself didn't have the patched version, but referred to its own embedded version to determine some target versions.

Moreover, the extra job didn't save any time because the CLI is built inside the bazel upgrade target too.

elchead · 2024-10-28T07:17:28Z

docs/docs/reference/migration.md

+
+### Azure
+
+* During the upgrade, security rules are migrated and the old ones are recommended to be cleaned up manually by the user. This step is recommended but not strictly required. The below script shows how to programatically delete the old rules through the Azure CLI:


I'd recommend to not make this optional, because we could not change the security rule name back and remove the priority offset in Terraform otherwise.

elchead · 2024-10-28T07:19:23Z

docs/docs/reference/migration.md

+
+### Azure
+
+* During the upgrade, security rules are migrated and the old ones need to be cleaned up manually by the user. The below script shows how to delete them through the Azure CLI:


Once approved, I will backport this to the v2.19 docs

docs/docs/reference/migration.md

3u13r

LGTM

3u13r · 2024-10-31T17:19:58Z

docs/docs/reference/migration.md

- To keep using an existing UAMI, add the `Owner` permission with the scope of your `resourceGroup`.
- Otherwise, simply [create new Constellation IAM credentials](../workflows/config.md#creating-an-iam-configuration) and use the created UAMI.
- To migrate the authentication for an existing cluster on Azure to an UAMI with the necessary permissions:
+* The `provider.azure.appClientID` and `provider.azure.appClientSecret` fields are no longer supported and should be removed.


I don't think we currently enforce any rules regarding * vs -. They seem to be interchangeable. Did your linter change this, or do you have a reason for this preference? I know that docs currently favor * (i.e. are used more). I think we can revert this for now to not have the last commit that changed the line updated and revisit this discussion later.

We agreed on markdownlint, which (unfortunately) picks the character of the topmost list item on the page. So if you wanted to keep the diff small but still use the linter, you would replace the first item with - in this case.

elchead · 2024-11-04T07:57:08Z

Upgrade test passed again: https://github.com/edgelesssys/constellation/actions/runs/11615051562

* fix security rule reconciliation on azure * fix simulated patch version upgrade

elchead changed the title ~~terraform: fix security rule reconcilation on azure~~ terraform: fix security rule reconcilation on Azure Oct 22, 2024

elchead requested review from 3u13r and burgerdev October 22, 2024 14:02

elchead added bug fix Fixing a bug needs backport This PR needs to be backported to a previous release labels Oct 22, 2024

elchead added this to the v2.20.0 milestone Oct 22, 2024

elchead force-pushed the as/fix-azure-upgrade branch from 1a3a5a6 to ba7c6d4 Compare October 23, 2024 09:00

elchead force-pushed the as/fix-azure-upgrade branch from 44f57f9 to ad268a3 Compare October 24, 2024 10:04

elchead commented Oct 24, 2024

View reviewed changes

elchead force-pushed the as/fix-azure-upgrade branch from ad268a3 to 2777da4 Compare October 24, 2024 10:12

elchead marked this pull request as ready for review October 24, 2024 12:03

elchead force-pushed the as/fix-azure-upgrade branch from 3ffed5a to 884bd3e Compare October 24, 2024 12:32

fix security rule reconcilation on azure

e1b9a3a

elchead force-pushed the as/fix-azure-upgrade branch 3 times, most recently from 2334a78 to f887c8e Compare October 24, 2024 12:36

fix simulated patch version upgrade

9cf08db

elchead force-pushed the as/fix-azure-upgrade branch from f887c8e to 9cf08db Compare October 24, 2024 12:36

3u13r requested changes Oct 25, 2024

View reviewed changes

elchead added 2 commits October 25, 2024 12:51

build the upgrade cli inside the upgrade job

e0af999

Revert changes of "fix simulated patch version upgrade" no longer req…

27fc9a5

…uired due to inline version.txt manipulation This reverts commit 9cf08db.

elchead requested a review from 3u13r October 25, 2024 11:09

use assert for upgrade check

2b6db25

elchead force-pushed the as/fix-azure-upgrade branch from ca8f679 to 2b6db25 Compare October 25, 2024 11:36

add to migration docs

296ad2b

elchead requested a review from thomasten as a code owner October 28, 2024 07:15

elchead commented Oct 28, 2024

View reviewed changes

fixup! add to migration docs

5b3a173

elchead commented Oct 28, 2024

View reviewed changes

3u13r requested changes Oct 31, 2024

View reviewed changes

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

leo feedback

ac3b05d

elchead requested a review from 3u13r October 31, 2024 15:30

3u13r requested changes Oct 31, 2024

View reviewed changes

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

elchead force-pushed the as/fix-azure-upgrade branch 3 times, most recently from 41313fd to 9f47e9a Compare October 31, 2024 15:38

fixup! leo feedback

14d3abc

elchead force-pushed the as/fix-azure-upgrade branch from 9f47e9a to 14d3abc Compare October 31, 2024 15:38

elchead requested a review from 3u13r October 31, 2024 15:38

thomasten reviewed Oct 31, 2024

View reviewed changes

docs/docs/reference/migration.md Outdated Show resolved Hide resolved

fixup! fixup! leo feedback

e31af63

3u13r approved these changes Oct 31, 2024

View reviewed changes

thomasten approved these changes Nov 3, 2024

View reviewed changes

elchead changed the title ~~terraform: fix security rule reconcilation on Azure~~ terraform: fix security rule reconciliation on Azure Nov 4, 2024

elchead merged commit 54058ee into main Nov 4, 2024
13 checks passed

elchead deleted the as/fix-azure-upgrade branch November 4, 2024 07:59

elchead mentioned this pull request Nov 5, 2024

docs: add migrate instructions to v2.19 #3469

Merged

5 tasks

elchead added a commit that referenced this pull request Nov 5, 2024

terraform: fix security rule reconciliation on Azure (#3454)

5196de1

* fix security rule reconciliation on azure * fix simulated patch version upgrade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terraform: fix security rule reconciliation on Azure #3454

terraform: fix security rule reconciliation on Azure #3454

elchead commented Oct 22, 2024 •

edited

Loading

netlify bot commented Oct 22, 2024 •

edited

Loading

elchead commented Oct 22, 2024

3u13r commented Oct 22, 2024

elchead commented Oct 24, 2024

elchead Oct 24, 2024

elchead Oct 24, 2024

3u13r Oct 25, 2024

elchead Oct 25, 2024

elchead Oct 24, 2024

3u13r commented Oct 24, 2024

3u13r left a comment

3u13r Oct 25, 2024

elchead commented Oct 25, 2024

elchead Oct 28, 2024

elchead Oct 28, 2024

3u13r left a comment

3u13r Oct 31, 2024

thomasten Nov 3, 2024

elchead commented Nov 4, 2024


		### Azure

		* During the upgrade, security rules are migrated and the old ones are recommended to be cleaned up manually by the user. This step is recommended but not strictly required. The below script shows how to programatically delete the old rules through the Azure CLI:

terraform: fix security rule reconciliation on Azure #3454

terraform: fix security rule reconciliation on Azure #3454

Conversation

elchead commented Oct 22, 2024 • edited Loading

Context

Proposed change(s)

Additional info

Checklist

netlify bot commented Oct 22, 2024 • edited Loading

✅ Deploy Preview for constellation-docs ready!

elchead commented Oct 22, 2024

3u13r commented Oct 22, 2024

elchead commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3u13r commented Oct 24, 2024

3u13r left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elchead commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3u13r left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elchead commented Nov 4, 2024

elchead commented Oct 22, 2024 •

edited

Loading

netlify bot commented Oct 22, 2024 •

edited

Loading