-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Await logic #2824
Labels
Comments
This was referenced Jul 24, 2024
Merged
blampe
added a commit
that referenced
this issue
Jul 25, 2024
This fixes 3 issues: 1. The biggest problem was that `relatedResource` only considered two resources related if they had equal generations. As far as I can tell this is almost never the case, at least not where we're currently using `PodAggregator` (Job/Pod and DaemonSet/Pod relationships) because Pods usually don't have a generation. This also doesn't make sense for ReplicaSet/Deployment relationships because the generations can differ. This PR relaxes that restriction, so we now surface informational messages that were previously being discarded during updates. 2. `isOwnedBy` was using heuristics on GVK and names to determine ownership but this can return false positives. This PR makes the logic exact by matching on UIDs. This is relevant to generic await logic, because we'll want to accurately report status messages for arbitrary Parent→Child relationships (including custom resources) so we need something more reliable than the name. 3. `relatedResource` and `isOwnedBy` were using different logic to determine whether one resource owns another. We can safely remove `relatedResource` since it is equivalent to `isOwnedBy` without the constraint on generation. Along the way I also removed `ResourceID` since it wasn't providing much value over plain unstructured resources. `statefulset_test.go` and `deployment_test.go` have large diffs but these are just mechanical changes to simply set UIDs correctly on the test resources. Example of DaemonSet output before this change: ``` + kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found + kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1 @ updating......... + kubernetes:apps/v1:DaemonSet ds creating (5s) All replicas scheduled as expected. Replicas: 1 ``` After this change the pod's status is reported: ``` + kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found + kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1 + kubernetes:apps/v1:DaemonSet ds creating (0s) warning: [Pod await-daemonset-xww9v]: containers with unready status: [nginx] @ updating......... + kubernetes:apps/v1:DaemonSet ds creating (6s) All replicas scheduled as expected. Replicas: 1 ``` Refs #2824.
This was referenced Jul 26, 2024
blampe
added a commit
that referenced
this issue
Aug 9, 2024
There's no functional difference between our "update" and "create" await logic -- in every case, our update awaiters simply invoke the create awaiter, or vice versa. (The only minor exception to this is `untilCoreV1ResourceQuotaUpdated`, which has a little logic to [short-circuit] but otherwise still calls into the corresponding create awaiter.) This PR consolidates the create/update configs into one. This will make it more straightforward to glue together our legacy awaiters. The first commit replaces `updateAwaitConfig` with `createAwaitConfig`, and the second commit is just renaming. The third commit adds test coverage for the `Update` path, essentially copy-pasted from the `Creation` tests. This will help guarantee that subsequent PRs preserve existing behavior. Fixes #2799. Refs #2824. [short-circuit]: https://github.com/pulumi/pulumi-kubernetes/blob/06b05ce7a3fb5358bdc03a1d105c42a6c695363f/provider/pkg/await/awaiters.go#L686-L691
blampe
added a commit
that referenced
this issue
Aug 13, 2024
Our current await logic for services expects the service to have _some_ endpoints, and all of those endpoints must be ready. This doesn't allow for the valid case where the service doesn't have any endpoints, and in those cases we will end up waiting forever. This changes our logic to instead expect 0 non-ready endpoints. This is equivalent to the current behavior when there are non-zero endpoints while also allowing the zero-endpoint case. We also short-circuit the endpoint logic when the service has no selector, since the endpoint will need to be manually configured. Fixes #605. Fixes #799. Refs #2824.
blampe
added a commit
that referenced
this issue
Aug 21, 2024
We already have "generic" await logic for deletion -- if we don't have an explicit delete-awaiter defined for a particular GVK, then we always run some logic that waits for the resource to 404. As it turns out, all of our custom delete-awaiters do essentially the same 404 check, modulo differences in the messages we log. This makes the deletion code flow a good starting place to introduce more generic/unified await logic. If you look at the code deleted in the second commit you get a good sense of the current issues with our awaiters: each custom awaiter is responsible for establishing its own watchers; determining its default timeout; performing the same 404 check; etc. There's a lot of duplication and subtle differences in behavior that leads to issues like #1232. As part of this change we start decomposing our await logic into more composable pieces. Note that I'm replacing our deletion code path with these new pieces because I don't think we lose much by changing the logged messages, but for the create/update code paths we'll need some glue to preserve existing await behavior. At a high level: 1. We determine what condition (`Satisfier`) to wait for during deletion. There is always a condition even if it's a no-op. Deletion is simple because there are only two possibilities -- "skip" and "wait for 404" -- but with create/update and user-defined conditions it will get more interesting. 2. We wait for the condition `Satisfier` and can combine it with arbitrary `Observers`. This lets us do things like log additional information while we're waiting, e.g. #3135. The underlying machinery responsible for handling timeouts, informers, etc. is all hidden behind the `Source`. Implementing new await logic is essentially just a matter of defining a new `Satisifer` which understands how to evaluate an unstructured resource. A number of unit tests are included as well as an E2E regression test to ensure we respect the `skipAwait` annotation. The existing delete-await tests are mostly unchanged except for tweaks to inject a `Condition` instead of an `awaitSpec`. Some watcher-specific tests were no longer relevant and were removed, however the functionality is still implemented/tested as part of `Awaiter`. Fixes #3157. Fixes #1418. Refs #2824.
blampe
added a commit
that referenced
this issue
Aug 21, 2024
This adds generic await logic that will be feature-flagged behind an environment variable `PULUMI_K8S_AWAIT_ALL`. When the environment variable is set, creates and updates will await as follows: 1. If the resource has the `skipAwait` annotation we will no-op. 2. If the resource has pre-existing await logic we will use it. 3. If the `PULUMI_K8S_AWAIT_ALL` environment variable isn't set, we will no-op. 4. If the `PULUMI_K8S_AWAIT_ALL` environment variable is set we will await according to the cli-utils [status] package. 1-3 preserve existing behavior; only 4 is net-new. (We're currently using the status package for our DaemonSet awaiter without issue.) Importantly, the package fails open in cases where a resource's readiness can't be determined. This means resources not following well-known conventions (like the Ready condition) will continue to no-op instead of waiting forever. User-defined await logic will be helpful in some of those situations. Refs #2824. Fixes #2996. [status]: https://pkg.go.dev/sigs.k8s.io/cli-utils/pkg/kstatus/status
blampe
added a commit
that referenced
this issue
Aug 21, 2024
This adds implementations for user-defined await logic. Two styles are supported, both equivalent to `kubectl`: * `condition=Type[=Value]` * `jsonpath={expr}[=Value]` The user can specify these via annotations, e.g.: * `pulumi.com/waitFor: condition=Foo` * `pulumi.com/waitFor: jsonpath={.webhooks[0].clientConfig.caBundle}` `kubectl --wait for=jsonpath=` supports a more relaxed syntax than `kubectl get -o jsonpath=` which I found to be somewhat buggy/surprising in how it parsed expressions, so for now we only support the stricter `get` syntax. Some code related to evaluating custom conditions is vendored from `kubectl` to stay as close as possible to upstream behavior. We use the same `k8s.io/client-go/util/jsonpath` library used by `kubectl` for evaluating JSONPath expressions. An E2E test is included which shows how this feature allows cert-manager to be successfully deployed after applying two `waitFor` annotations. (Something we would eventually want baked in to pulumi-kubernetes-cert-manager.) Fixes #1260. Refs #2824.
blampe
added a commit
that referenced
this issue
Aug 22, 2024
This adds a new `EventAggregator` and hooks it up to all of our new awaiters. This has the effect of surfacing Warning events to the user while any await logic is ongoing, including all of our existing/legacy awaiters. For example, attempting to unseal an invalid secret will show a temporary warning: ```yaml type: kubernetes:yaml/v2:ConfigGroup properties: yaml: | apiVersion: bitnami.com/v1alpha1 kind: SealedSecret metadata: name: mysecret namespace: default annotations: pulumi.com/waitFor: condition=Synced spec: encryptedData: foo: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq... ``` Interactive: ``` + └─ kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (7s) warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo) ``` Non-interactive: ``` + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: Has no .status.conditions + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) Resource has condition Synced=False (want True) + kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo) @ updating............ ``` Fixes #133. Refs #2824.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Bug fixes and enhancements to the way Pulumi Kubernetes handles resource readiness.
pulumi.com/skipAwait
annotation to skip await only on certain operations.pulumi.com/waitFor
annotation that for custom readiness criteria.To Triage
FailedCreate
information #191Tests
Tasks
myService.metadata.name
references #1875Won't Do
The text was updated successfully, but these errors were encountered: