Improve Await logic #2824

mjeffryes · 2024-02-15T02:36:02Z

Bug fixes and enhancements to the way Pulumi Kubernetes handles resource readiness.

Surface more helpful errors when a resource doesn't become ready in time.
More options on the pulumi.com/skipAwait annotation to skip await only on certain operations.
A pulumi.com/waitFor annotation that for custom readiness criteria.
Experimental support for default readiness criteria for all resources.

To Triage

Give feedback

Populate errors using events #133

area/await-logic area/inner-dev-loop area/resource-management kind/enhancement resolution/fixed
Deployment awaiters do not use FailedCreate information #191

area/await-logic area/inner-dev-loop area/resource-management kind/enhancement
Simplify Deployment awaiter #451

area/await-logic kind/engineering resolution/wont-fix
Simplify StatefulSet awaiter #452

area/await-logic kind/engineering resolution/wont-fix
Simplify Service awaiter #453

area/await-logic kind/engineering resolution/wont-fix
Simplify Ingress awaiter #454

area/await-logic kind/engineering resolution/wont-fix
Kubernetes resources scaled to 0 replicas fail to update #605

area/await-logic customer/feedback kind/bug
Unhandled case - Service without selector + manual Endpoint #799

area/await-logic area/community kind/bug
Ingress await logic doesn't handle ssl-redirect case for aws-alb-ingress-controller #408

area/await-logic kind/bug
Cannot use PVC (eternal wait) #895

area/await-logic kind/bug resolution/fixed
panic when interrupting deletion #3157

area/await-logic impact/panic kind/bug resolution/fixed
AWS ALB Ingress controller has spurious errors with SSL Redirection #1359

area/await-logic kind/bug resolution/duplicate
Destroying a resource fails with "timed out waiting to be Ready" #1418

kind/enhancement resolution/fixed
Ingress awaiter may not be backward compatible #1655

area/await-logic kind/bug
Options

Tests

Give feedback

Unittests for await logic (Read) #2798

kind/engineering
Unittests for await logic (Update) #2799

kind/engineering
Unittests for await logic (Delete) #2800

kind/engineering resolution/fixed
Options

Tasks

Give feedback

Support user-specified readiness/await logic #1260

area/await-logic kind/enhancement resolution/fixed
Circular Dependencies, such as when using myService.metadata.name references #1875

area/await-logic kind/bug kind/question
Add await logic for Patch resources #2455

area/await-logic kind/enhancement
Add await logic for DaemonSets #609

area/await-logic customer/lighthouse kind/enhancement resolution/fixed
Design/DEcision doc about whether to rely on the API server for status rather than doing it ourselves. #2829

kind/task resolution/fixed
Add a default status awaiter #2996

area/await-logic kind/enhancement resolution/fixed
Options

Won't Do

Give feedback

Decide how skipAwait should interact with deletion #1232

area/await-logic kind/bug kind/design needs-design
skipawait per-operation (create, update, delete) #2551

area/await-logic impact/performance kind/enhancement needs-design
Options

The text was updated successfully, but these errors were encountered:

This fixes 3 issues: 1. The biggest problem was that `relatedResource` only considered two resources related if they had equal generations. As far as I can tell this is almost never the case, at least not where we're currently using `PodAggregator` (Job/Pod and DaemonSet/Pod relationships) because Pods usually don't have a generation. This also doesn't make sense for ReplicaSet/Deployment relationships because the generations can differ. This PR relaxes that restriction, so we now surface informational messages that were previously being discarded during updates. 2. `isOwnedBy` was using heuristics on GVK and names to determine ownership but this can return false positives. This PR makes the logic exact by matching on UIDs. This is relevant to generic await logic, because we'll want to accurately report status messages for arbitrary Parent→Child relationships (including custom resources) so we need something more reliable than the name. 3. `relatedResource` and `isOwnedBy` were using different logic to determine whether one resource owns another. We can safely remove `relatedResource` since it is equivalent to `isOwnedBy` without the constraint on generation. Along the way I also removed `ResourceID` since it wasn't providing much value over plain unstructured resources. `statefulset_test.go` and `deployment_test.go` have large diffs but these are just mechanical changes to simply set UIDs correctly on the test resources. Example of DaemonSet output before this change: ``` + kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found + kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1 @ updating......... + kubernetes:apps/v1:DaemonSet ds creating (5s) All replicas scheduled as expected. Replicas: 1 ``` After this change the pod's status is reported: ``` + kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found + kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) warning: [Pod await-daemonset-xww9v]: containers with unready status: [nginx] @ updating......... + kubernetes:apps/v1:DaemonSet ds creating (6s) All replicas scheduled as expected. Replicas: 1 ``` Refs #2824.

There's no functional difference between our "update" and "create" await logic -- in every case, our update awaiters simply invoke the create awaiter, or vice versa. (The only minor exception to this is `untilCoreV1ResourceQuotaUpdated`, which has a little logic to [short-circuit] but otherwise still calls into the corresponding create awaiter.) This PR consolidates the create/update configs into one. This will make it more straightforward to glue together our legacy awaiters. The first commit replaces `updateAwaitConfig` with `createAwaitConfig`, and the second commit is just renaming. The third commit adds test coverage for the `Update` path, essentially copy-pasted from the `Creation` tests. This will help guarantee that subsequent PRs preserve existing behavior. Fixes #2799. Refs #2824. [short-circuit]: https://github.com/pulumi/pulumi-kubernetes/blob/06b05ce7a3fb5358bdc03a1d105c42a6c695363f/provider/pkg/await/awaiters.go#L686-L691

Our current await logic for services expects the service to have _some_ endpoints, and all of those endpoints must be ready. This doesn't allow for the valid case where the service doesn't have any endpoints, and in those cases we will end up waiting forever. This changes our logic to instead expect 0 non-ready endpoints. This is equivalent to the current behavior when there are non-zero endpoints while also allowing the zero-endpoint case. We also short-circuit the endpoint logic when the service has no selector, since the endpoint will need to be manually configured. Fixes #605. Fixes #799. Refs #2824.

We already have "generic" await logic for deletion -- if we don't have an explicit delete-awaiter defined for a particular GVK, then we always run some logic that waits for the resource to 404. As it turns out, all of our custom delete-awaiters do essentially the same 404 check, modulo differences in the messages we log. This makes the deletion code flow a good starting place to introduce more generic/unified await logic. If you look at the code deleted in the second commit you get a good sense of the current issues with our awaiters: each custom awaiter is responsible for establishing its own watchers; determining its default timeout; performing the same 404 check; etc. There's a lot of duplication and subtle differences in behavior that leads to issues like #1232. As part of this change we start decomposing our await logic into more composable pieces. Note that I'm replacing our deletion code path with these new pieces because I don't think we lose much by changing the logged messages, but for the create/update code paths we'll need some glue to preserve existing await behavior. At a high level: 1. We determine what condition (`Satisfier`) to wait for during deletion. There is always a condition even if it's a no-op. Deletion is simple because there are only two possibilities -- "skip" and "wait for 404" -- but with create/update and user-defined conditions it will get more interesting. 2. We wait for the condition `Satisfier` and can combine it with arbitrary `Observers`. This lets us do things like log additional information while we're waiting, e.g. #3135. The underlying machinery responsible for handling timeouts, informers, etc. is all hidden behind the `Source`. Implementing new await logic is essentially just a matter of defining a new `Satisifer` which understands how to evaluate an unstructured resource. A number of unit tests are included as well as an E2E regression test to ensure we respect the `skipAwait` annotation. The existing delete-await tests are mostly unchanged except for tweaks to inject a `Condition` instead of an `awaitSpec`. Some watcher-specific tests were no longer relevant and were removed, however the functionality is still implemented/tested as part of `Awaiter`. Fixes #3157. Fixes #1418. Refs #2824.

This adds generic await logic that will be feature-flagged behind an environment variable `PULUMI_K8S_AWAIT_ALL`. When the environment variable is set, creates and updates will await as follows: 1. If the resource has the `skipAwait` annotation we will no-op. 2. If the resource has pre-existing await logic we will use it. 3. If the `PULUMI_K8S_AWAIT_ALL` environment variable isn't set, we will no-op. 4. If the `PULUMI_K8S_AWAIT_ALL` environment variable is set we will await according to the cli-utils [status] package. 1-3 preserve existing behavior; only 4 is net-new. (We're currently using the status package for our DaemonSet awaiter without issue.) Importantly, the package fails open in cases where a resource's readiness can't be determined. This means resources not following well-known conventions (like the Ready condition) will continue to no-op instead of waiting forever. User-defined await logic will be helpful in some of those situations. Refs #2824. Fixes #2996. [status]: https://pkg.go.dev/sigs.k8s.io/cli-utils/pkg/kstatus/status

This adds implementations for user-defined await logic. Two styles are supported, both equivalent to `kubectl`: * `condition=Type[=Value]` * `jsonpath={expr}[=Value]` The user can specify these via annotations, e.g.: * `pulumi.com/waitFor: condition=Foo` * `pulumi.com/waitFor: jsonpath={.webhooks[0].clientConfig.caBundle}` `kubectl --wait for=jsonpath=` supports a more relaxed syntax than `kubectl get -o jsonpath=` which I found to be somewhat buggy/surprising in how it parsed expressions, so for now we only support the stricter `get` syntax. Some code related to evaluating custom conditions is vendored from `kubectl` to stay as close as possible to upstream behavior. We use the same `k8s.io/client-go/util/jsonpath` library used by `kubectl` for evaluating JSONPath expressions. An E2E test is included which shows how this feature allows cert-manager to be successfully deployed after applying two `waitFor` annotations. (Something we would eventually want baked in to pulumi-kubernetes-cert-manager.) Fixes #1260. Refs #2824.

This adds a new `EventAggregator` and hooks it up to all of our new awaiters. This has the effect of surfacing Warning events to the user while any await logic is ongoing, including all of our existing/legacy awaiters. For example, attempting to unseal an invalid secret will show a temporary warning: ```yaml type: kubernetes:yaml/v2:ConfigGroup properties: yaml: | apiVersion: bitnami.com/v1alpha1 kind: SealedSecret metadata: name: mysecret namespace: default annotations: pulumi.com/waitFor: condition=Synced spec: encryptedData: foo: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq... ``` Interactive: ``` + └─ kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (7s) warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo) ``` Non-interactive: ``` + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: Has no .status.conditions + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) Resource has condition Synced=False (want True) + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo) @ updating............ ``` Fixes #133. Refs #2824.

mjeffryes assigned rquitales Feb 15, 2024

EronWright added the kind/epic Large new features or investments label Feb 26, 2024

mjeffryes unassigned rquitales Apr 28, 2024

mjeffryes assigned blampe May 6, 2024

mjeffryes modified the milestone: 0.107 Jul 24, 2024

This was referenced Jul 24, 2024

Fix pod info reporting #3128

Merged

Refresh shouldn't fail if resources aren't ready #3129

Open

mjeffryes added the resolution/fixed This issue was fixed label Sep 9, 2024

mjeffryes closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Await logic #2824

Improve Await logic #2824

mjeffryes commented Feb 15, 2024 •

edited

Loading

To Triage

Tests

Tasks

Won't Do

Improve Await logic #2824

Improve Await logic #2824

Comments

mjeffryes commented Feb 15, 2024 • edited Loading

To Triage

Tests

Tasks

Won't Do

mjeffryes commented Feb 15, 2024 •

edited

Loading