Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Await logic #2824

Closed
17 of 25 tasks
mjeffryes opened this issue Feb 15, 2024 · 0 comments
Closed
17 of 25 tasks

Improve Await logic #2824

mjeffryes opened this issue Feb 15, 2024 · 0 comments
Assignees
Labels
kind/epic Large new features or investments resolution/fixed This issue was fixed

Comments

@mjeffryes
Copy link
Member

mjeffryes commented Feb 15, 2024

Bug fixes and enhancements to the way Pulumi Kubernetes handles resource readiness.

  • Surface more helpful errors when a resource doesn't become ready in time.
  • More options on the pulumi.com/skipAwait annotation to skip await only on certain operations.
  • A pulumi.com/waitFor annotation that for custom readiness criteria.
  • Experimental support for default readiness criteria for all resources.

To Triage

Preview Give feedback
  1. area/await-logic area/inner-dev-loop area/resource-management kind/enhancement resolution/fixed
    blampe
  2. area/await-logic area/inner-dev-loop area/resource-management kind/enhancement
  3. area/await-logic kind/engineering resolution/wont-fix
    mjeffryes
  4. area/await-logic kind/engineering resolution/wont-fix
    mjeffryes
  5. area/await-logic kind/engineering resolution/wont-fix
    mjeffryes
  6. area/await-logic kind/engineering resolution/wont-fix
    mjeffryes
  7. area/await-logic customer/feedback kind/bug
    blampe
  8. area/await-logic area/community kind/bug
    blampe
  9. area/await-logic kind/bug
  10. area/await-logic kind/bug resolution/fixed
    blampe
  11. area/await-logic impact/panic kind/bug resolution/fixed
    blampe
  12. area/await-logic kind/bug resolution/duplicate
    blampe
  13. kind/enhancement resolution/fixed
    blampe
  14. area/await-logic kind/bug

Tests

Preview Give feedback
  1. kind/engineering
    blampe
  2. kind/engineering
    blampe
  3. kind/engineering resolution/fixed
    EronWright

Tasks

Preview Give feedback
  1. area/await-logic kind/enhancement resolution/fixed
    blampe
  2. area/await-logic kind/bug kind/question
  3. area/await-logic kind/enhancement
  4. area/await-logic customer/lighthouse kind/enhancement resolution/fixed
    blampe
  5. kind/task resolution/fixed
    rquitales
  6. area/await-logic kind/enhancement resolution/fixed
    blampe

Won't Do

Preview Give feedback
  1. area/await-logic kind/bug kind/design needs-design
  2. area/await-logic impact/performance kind/enhancement needs-design
@EronWright EronWright added the kind/epic Large new features or investments label Feb 26, 2024
@mjeffryes mjeffryes modified the milestone: 0.107 Jul 24, 2024
blampe added a commit that referenced this issue Jul 25, 2024
This fixes 3 issues:

1. The biggest problem was that `relatedResource` only considered two
resources related if they had equal generations. As far as I can tell
this is almost never the case, at least not where we're currently using
`PodAggregator` (Job/Pod and DaemonSet/Pod relationships) because Pods
usually don't have a generation. This also doesn't make sense for
ReplicaSet/Deployment relationships because the generations can differ.
This PR relaxes that restriction, so we now surface informational
messages that were previously being discarded during updates.
2. `isOwnedBy` was using heuristics on GVK and names to determine
ownership but this can return false positives. This PR makes the logic
exact by matching on UIDs. This is relevant to generic await logic,
because we'll want to accurately report status messages for arbitrary
Parent→Child relationships (including custom resources) so we need
something more reliable than the name.
3. `relatedResource` and `isOwnedBy` were using different logic to
determine whether one resource owns another. We can safely remove
`relatedResource` since it is equivalent to `isOwnedBy` without the
constraint on generation.

Along the way I also removed `ResourceID` since it wasn't providing much
value over plain unstructured resources.

`statefulset_test.go` and `deployment_test.go` have large diffs but
these are just mechanical changes to simply set UIDs correctly on the
test resources.

Example of DaemonSet output before this change:
```
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1
@ updating.........
 +  kubernetes:apps/v1:DaemonSet ds creating (5s) All replicas scheduled as expected. Replicas: 1
```

After this change the pod's status is reported:
```
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) DaemonSet status.observedGeneration not found
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) Current: 0/1
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) Available: 0/1
 +  kubernetes:apps/v1:DaemonSet ds creating (0s) warning: [Pod await-daemonset-xww9v]: containers with unready status: [nginx]
@ updating.........
 +  kubernetes:apps/v1:DaemonSet ds creating (6s) All replicas scheduled as expected. Replicas: 1
```

Refs #2824.
blampe added a commit that referenced this issue Aug 9, 2024
There's no functional difference between our "update" and "create" await logic
-- in every case, our update awaiters simply invoke the create awaiter, or vice
versa.

(The only minor exception to this is `untilCoreV1ResourceQuotaUpdated`, which
has a little logic to [short-circuit] but otherwise still calls into the
corresponding create awaiter.)

This PR consolidates the create/update configs into one. This will make it more
straightforward to glue together our legacy awaiters.

The first commit replaces `updateAwaitConfig` with `createAwaitConfig`, and the
second commit is just renaming.

The third commit adds test coverage for the `Update` path, essentially
copy-pasted from the `Creation` tests. This will help guarantee that subsequent
PRs preserve existing behavior.

Fixes #2799.
Refs #2824.

[short-circuit]:
https://github.com/pulumi/pulumi-kubernetes/blob/06b05ce7a3fb5358bdc03a1d105c42a6c695363f/provider/pkg/await/awaiters.go#L686-L691
blampe added a commit that referenced this issue Aug 13, 2024
Our current await logic for services expects the service to have _some_
endpoints, and all of those endpoints must be ready. This doesn't allow
for the valid case where the service doesn't have any endpoints, and in
those cases we will end up waiting forever.

This changes our logic to instead expect 0 non-ready endpoints. This is
equivalent to the current behavior when there are non-zero endpoints
while also allowing the zero-endpoint case.

We also short-circuit the endpoint logic when the service has no
selector, since the endpoint will need to be manually configured.

Fixes #605.
Fixes #799.
Refs #2824.
blampe added a commit that referenced this issue Aug 21, 2024
We already have "generic" await logic for deletion -- if we don't have an
explicit delete-awaiter defined for a particular GVK, then we always run some
logic that waits for the resource to 404.

As it turns out, all of our custom delete-awaiters do essentially the same 404
check, modulo differences in the messages we log. This makes the deletion code
flow a good starting place to introduce more generic/unified await logic.

If you look at the code deleted in the second commit you get a good sense of
the current issues with our awaiters: each custom awaiter is responsible for
establishing its own watchers; determining its default timeout; performing the
same 404 check; etc. There's a lot of duplication and subtle differences in
behavior that leads to issues like
#1232.

As part of this change we start decomposing our await logic into more
composable pieces. Note that I'm replacing our deletion code path with these
new pieces because I don't think we lose much by changing the logged messages,
but for the create/update code paths we'll need some glue to preserve existing
await behavior.

At a high level:
1. We determine what condition (`Satisfier`) to wait for during deletion. There
  is always a condition even if it's a no-op. Deletion is simple because there
  are only two possibilities -- "skip" and "wait for 404" -- but with
  create/update and user-defined conditions it will get more interesting.
2. We wait for the condition `Satisfier` and can combine it with arbitrary
  `Observers`. This lets us do things like log additional information while we're
  waiting, e.g. #3135.

The underlying machinery responsible for handling timeouts, informers, etc. is
all hidden behind the `Source`. Implementing new await logic is essentially
just a matter of defining a new `Satisifer` which understands how to evaluate
an unstructured resource.

A number of unit tests are included as well as an E2E regression test to ensure
we respect the `skipAwait` annotation. The existing delete-await tests are
mostly unchanged except for tweaks to inject a `Condition` instead of an
`awaitSpec`. Some watcher-specific tests were no longer relevant and were
removed, however the functionality is still implemented/tested as part of
`Awaiter`.

Fixes #3157.
Fixes #1418.
Refs #2824.
blampe added a commit that referenced this issue Aug 21, 2024
This adds generic await logic that will be feature-flagged behind an
environment variable `PULUMI_K8S_AWAIT_ALL`.

When the environment variable is set, creates and updates will await as
follows:

  1. If the resource has the `skipAwait` annotation we will no-op.
  2. If the resource has pre-existing await logic we will use it.
  3. If the `PULUMI_K8S_AWAIT_ALL` environment variable isn't set, we will
     no-op.
  4. If the `PULUMI_K8S_AWAIT_ALL` environment variable is set we will
     await according to the cli-utils [status] package.

1-3 preserve existing behavior; only 4 is net-new.

(We're currently using the status package for our DaemonSet awaiter without
issue.)

Importantly, the package fails open in cases where a resource's readiness can't
be determined. This means resources not following well-known conventions (like
the Ready condition) will continue to no-op instead of waiting forever.
User-defined await logic will be helpful in some of those situations.

Refs #2824.
Fixes #2996.

[status]: https://pkg.go.dev/sigs.k8s.io/cli-utils/pkg/kstatus/status
blampe added a commit that referenced this issue Aug 21, 2024
This adds implementations for user-defined await logic.

Two styles are supported, both equivalent to `kubectl`:
* `condition=Type[=Value]`
* `jsonpath={expr}[=Value]`

The user can specify these via annotations, e.g.:
* `pulumi.com/waitFor: condition=Foo`
* `pulumi.com/waitFor: jsonpath={.webhooks[0].clientConfig.caBundle}`

`kubectl --wait for=jsonpath=` supports a more relaxed syntax than `kubectl get
-o jsonpath=` which I found to be somewhat buggy/surprising in how it parsed
expressions, so for now we only support the stricter `get` syntax.

Some code related to evaluating custom conditions is vendored from `kubectl` to
stay as close as possible to upstream behavior.

We use the same `k8s.io/client-go/util/jsonpath` library used by `kubectl` for
evaluating JSONPath expressions.

An E2E test is included which shows how this feature allows cert-manager to be
successfully deployed after applying two `waitFor` annotations. (Something we
would eventually want baked in to pulumi-kubernetes-cert-manager.)

Fixes #1260.
Refs #2824.
blampe added a commit that referenced this issue Aug 22, 2024
This adds a new `EventAggregator` and hooks it up to all of our new awaiters.

This has the effect of surfacing Warning events to the user while any await
logic is ongoing, including all of our existing/legacy awaiters.

For example, attempting to unseal an invalid secret will show a
temporary warning:

```yaml
  type: kubernetes:yaml/v2:ConfigGroup
    properties:
      yaml: |
        apiVersion: bitnami.com/v1alpha1
        kind: SealedSecret
        metadata:
          name: mysecret
          namespace: default
          annotations:
            pulumi.com/waitFor: condition=Synced
        spec:
          encryptedData:
            foo: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
```

Interactive:
```
 +      └─ kubernetes:bitnami.com/v1alpha1:SealedSecret  secret:default/mysecret  creating (7s)     warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo)
```

Non-interactive:
```
 +  kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: Has no .status.conditions
 +  kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) Resource has condition Synced=False (want True)
 +  kubernetes:bitnami.com/v1alpha1:SealedSecret secret:default/mysecret creating (0s) warning: [sealedsecret/mysecret] ErrUnsealFailed: Failed to unseal: illegal base64 data at input byte 32 (foo)
@ updating............
```

Fixes #133.
Refs #2824.
@mjeffryes mjeffryes added the resolution/fixed This issue was fixed label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Large new features or investments resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

4 participants