Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation prevents GKE nodepool to scale down #13984

Closed
paoloyx opened this issue May 17, 2023 · 16 comments · Fixed by #14035
Assignees
Labels
kind/question Further information is requested triage/needs-user-input Issues which are waiting on a response from the reporter
Milestone

Comments

@paoloyx
Copy link

paoloyx commented May 17, 2023

Describe the bug
It's not a bug but a support request.

The GKE autoscaler tells me that cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation is preventing nodepool to be scaled-down. The annotation is found on these pods:

  • activator-xxxxx
  • autoscaler-xxxxx
  • domainmapping-webhook-xxxxx

Expected behavior
I'm running pods in multi-replica with a PDB configured with minAvailability - made possible by knative/operator#1125 - and I would expect node to be drained. For sure I'm doing it for the activator pod.

Is the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" a mandatory one? Is there a way to configure it another way so the nodepool can properly scale down?

To Reproduce
N/A

Knative release version
Knative operator running in cluster at version 1.7.1. Also Knative-serving is at version 1.7.1

Additional context
GKE cluster is a standard one and running the 1.23.17-gke.1700 version of K8s

@paoloyx paoloyx added the kind/question Further information is requested label May 17, 2023
@dprotaso
Copy link
Member

@paoloyx I closed the operator issue as a dupe - and pull in the body. We can sort stuff out here and if this is really a conflicting config with the operator pdb then we can transfer the issue over to that repo

@dprotaso
Copy link
Member

dprotaso commented May 17, 2023

Is the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" a mandatory one?

This was added long ago (#4329) - so this might be a case were they aren't necessary when there is a Pod Disruption Budget.

Can you elaborate on what you expect to see when a node is being drained?

@dprotaso dprotaso added the triage/needs-user-input Issues which are waiting on a response from the reporter label May 17, 2023
@paoloyx
Copy link
Author

paoloyx commented May 18, 2023

@dprotaso thanks for having managed the duplicate thing, I was not sure were I was expected to open the issue.

For sure I'll try to elaborate but please bear in mind that I'm not so very expert about Knative jobs/services - that's to say what are the duties of activator or autoscaler or other services.

A premise about used configuration

We're using v1.7.1 version of operator to install Knative into a 1.23.17-gke.1700 GKE standard cluster. For example the serving part is installed with this CR

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  version: 1.7.1
  high-availability:
    replicas: 2
  config:
    features:
      kubernetes.podspec-fieldref: "enabled"
    gc:
      min-non-active-revisions: "0"
      max-non-active-revisions: "0"
      retain-since-create-time: "disabled"
      retain-since-last-active-time: "disabled"
  deployments:
  - name: autoscaler
    replicas: 1
  - name: autoscaler-hpa
    replicas: 1
  - name: controller
    replicas: 1
  - name: domain-mapping 
    replicas: 1
  - name: domainmapping-webhook
    replicas: 1
  - name: net-istio-controller
    replicas: 1
  - name: net-istio-webhook
    replicas: 1
  podDisruptionBudgets:
  - name: activator-pdb
    minAvailable: 50%
  - name: webhook-pdb
    minAvailable: 50%

It probably seems a cumbersome configuration.
The rationale is that before knative/operator#1125 the PDB configuration of some services did prevent nodes from being drained, as there were no disruptions available.
Thanks to knative/operator#1125 I was able to put affected services (i.e. activator and webhook) in replica two and a PDB definition that allows to have at least a disruption (i.e. minAvailable: 50% so one pod out of two can be evicted, when a nodepool is being drained one node at a time).

Scale down is still prevented

As of now I assume that PDB configuration is correct and allows me to drain nodes, so nodepool can be scaled down by GKE autoscaler.

The autoscaler noScaleDown events can be seen from Google console, and state that following pods prevent downscale with "no.scale.down.node.pod.not.safe.to.evict.annotation" because they are annotated with the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation:

  • activator-xxxxx --> only this workload also has the PDB
  • autoscaler-xxxxx
  • domainmapping-webhook-xxxxx

Conclusion

I do not know if the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" could be made optional, so to let cluster autoscalers (in my case a GKE one) properly scale down nodepools

Hope that I gave all the necessary details, let me know @dprotaso, thank you very much!

@paoloyx
Copy link
Author

paoloyx commented May 18, 2023

Just wanted to add some other details about another cluster where we're experiencing the same "issue"...there are other pods outside of the ones I reported before, namely:

  • webhook-xxxx
  • operator-webhook-xxxx

So, in general, it seems (or at least is my suspect) that all is linked to that cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation that is put into a variety of workloads, as you pointed out @dprotaso

Thanks, again, for looking at this one

@paoloyx
Copy link
Author

paoloyx commented May 23, 2023

@dprotaso Hi, just to know if the provided input is enough or if you still need info on my side. I'll be more than glad to give them if there is something more specific that you need, thank you so much.

@dprotaso
Copy link
Member

The original intent is that the activator, autoscaler, webhook etc should always be running otherwise you'll have a disruption in traffic and creating knative service.

It seems like these annotations were added prior to PDB being stable in Kubernetes. The use of cluster-autoscaler.kubernetes.io/safe-to-evict: "false" confuses me since at some point we do need to drain nodes (ie. cluster updating nodes on GKE). It seems the workaround requires manual intervention to get the nodes to drain?

I do not know if the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" could be made optional, so to let cluster autoscalers (in my case a GKE one) properly scale down nodepools

Based on what I've been reading [1,2] it seems like PDB is the better the way to control pod counts when there is a disruption (scaling down nodes) by safely reshuffling pods onto ready nodes. I think the Knative operator should be removing safe-to-evict when the PDB is being overriden for a component.

I'm inclined to remove safe-to-evict and add PDB's for all the control plane components - but I'd like to get feedback from other users running Knative in production.

[1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
[2] https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

@dprotaso
Copy link
Member

dprotaso commented May 23, 2023

Looks like Tekton removed the annotation as well - tektoncd/pipeline#4124 to unblock node scale down. But they removed their PodDisruptionBudgets - tektoncd/pipeline#3787 because it was preventing draining in non-HA configurations.

@paoloyx
Copy link
Author

paoloyx commented May 24, 2023

@dprotaso

The original intent is that the activator, autoscaler, webhook etc should always be running otherwise you'll have a disruption in traffic and creating knative service.

It seems like these annotations were added prior to PDB being stable in Kubernetes.
it looks that they then seemed like a reasonable choice to prevent disruptions

Yes, I can understand why those annotations were introduced then

The use of cluster-autoscaler.kubernetes.io/safe-to-evict: "false" confuses me since at some point we do need to drain nodes (ie. cluster updating nodes on GKE). It seems the workaround requires manual intervention to get the nodes to drain?

Exactly, the whole nodepool upgrade process eventually stops himself and there is the need to do a manual operation to rollout the stucked Knative component

I'm inclined to remove safe-to-evict and add PDB's for all the control plane components - but I'd like to get feedback from other users running Knative in production.
ok, thank you for you reasonings and I agree that PDBs could be the right way to handle disruptions, especially when running in HA.

Let's wait for some feedback, as you suggest. It's perfectly reasonable.
While waiting for an approved solution, we'll like to use something like explained in Override replicas, labels and annotations

So for example we could do

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  high-availability:
    replicas: 2
  deployments:
  - name: autoscaler
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

Could it work in your opinion with 1.7.1 version of both operator and serving components?

@paoloyx
Copy link
Author

paoloyx commented May 24, 2023

##EDIT##

The "solution" explained below actually does not prevent cluster to tear down knative-workload anymore, so it is not probably a good one as could cause service disruption. I think that we need a proper PDB support as outlined in previous posts


So, after a quick test about the usage of annotations in KnativeServing CR (see [1]) I can confirm that it seems to work as a temporary solution to override the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotations that are blocking scaling down.

  • Spun up a local k8s cluster with k3d (1.25 but i think that any recent version is ok)
  • Installed the 1.7.1 operator from [2]
  • Created a CR like the following
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  version: 1.7.1
  high-availability:
    replicas: 2
  deployments:
  - name: webhook
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
  podDisruptionBudgets:
  - name: activator-pdb
    minAvailable: 50%
  - name: webhook-pdb
    minAvailable: 50%

The outcome is that

  • Every pod runs in replica 2, as expected
kubectl get pods -n knative-serving
NAME                                                    READY   STATUS      RESTARTS   AGE
autoscaler-hpa-755fcd98f6-rng6x                         1/1     Running     0          5m51s
autoscaler-hpa-755fcd98f6-6b5m6                         1/1     Running     0          5m51s
storage-version-migration-serving-serving-1.7.1-sjpcb   0/1     Completed   0          5m51s
controller-8566d48f88-sbtxs                             1/1     Running     0          5m54s
controller-8566d48f88-2wpsv                             1/1     Running     0          5m54s
domainmapping-webhook-56cf45569d-drtw7                  1/1     Running     0          5m53s
domain-mapping-68c84654d5-vcr54                         1/1     Running     0          5m53s
domain-mapping-68c84654d5-m5b4f                         1/1     Running     0          5m53s
domainmapping-webhook-56cf45569d-l7xm5                  1/1     Running     0          5m53s
webhook-7fd7f67bdc-k8d5b                                1/1     Running     0          5m52s
webhook-7fd7f67bdc-ggxdg                                1/1     Running     0          5m52s
autoscaler-754fb5b97d-lwd95                             1/1     Running     0          5m54s
autoscaler-754fb5b97d-bfwpt                             1/1     Running     0          5m54s
activator-7b9c99b8f6-l6zd8                              1/1     Running     0          5m54s
activator-7b9c99b8f6-grxdm                              1/1     Running     0          5m54s
  • The HorizontalPodAutoscalers are configured as expected
kubectl get hpa -n knative-serving
NAME        REFERENCE              TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
activator   Deployment/activator   0%/100%    2         21        2          6m47s
webhook     Deployment/webhook     10%/100%   2         6         2          6m45s
  • The PodDisruptionBudgets are configured as expected (we have allowedDisruptions available i.e. we can drain nodes)
 kubectl get pdb -n knative-serving
NAME            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
webhook-pdb     50%             N/A               1                     8m19s
activator-pdb   50%             N/A               1                     8m22s
  • The annotations are effectively overridden for webhook, while are left unaffected for other workloads i.e. activator
❯ kubectl get deploy -n knative-serving webhook -o jsonpath='{.spec.template.metadata.annotations}'
{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"}%
❯ kubectl get deploy -n knative-serving activator -o jsonpath='{.spec.template.metadata.annotations}'
{"cluster-autoscaler.kubernetes.io/safe-to-evict":"false"}%

[1] Override replicas, labels and annotations
[2] Knative-operator 1.7.1

@vadasambar
Copy link

vadasambar commented May 25, 2023

Note that cluster-autoscaler has been supporting checking for PDBs when it evicts pods from the node (in the process of draining a node to scale it down) since 1.6.X version of Kubernetes (ref).

Here's how cluster-autoscaler behaves:

  1. check if cluster-autoscaler.kubernetes.io/safe-to-evict is set to true
  2. if it is set to true, check if evicting the pod disrupts PDBs
    1. if it does, block draining of the node
    2. if it doesn't, continue with draining of the node
  3. if it is not set to true
    1. if it is set to false
      1. block draining of the node

Code refs:

File: simulator/drain.go
095: 	pods, daemonSetPods, blockingPod, err = drain.GetPodsForDeletionOnNodeDrain(
096: 		pods,
097: 		pdbs,
098: 		deleteOptions.SkipNodesWithSystemPods,
099: 		deleteOptions.SkipNodesWithLocalStorage,
100: 		deleteOptions.SkipNodesWithCustomControllerPods,
101: 		listers,
102: 		int32(deleteOptions.MinReplicaCount),
103: 		timestamp)
104: 	...
109: 	if pdbBlockingPod, err := checkPdbs(pods, pdbs); err != nil {
110: 		return []*apiv1.Pod{}, []*apiv1.Pod{}, pdbBlockingPod, err
111: 	}

https://github.com/kubernetes/autoscaler/blob/d1740df93b76d99bf9302ad0c62978deb0ec1d5b/cluster-autoscaler/simulator/drain.go#L95-L111

If you check GetPodsForDeletionOnNodeDrain

File: utils/drain/drain.go
110: 		safeToEvict := hasSafeToEvictAnnotation(pod)
...
129: 		if !safeToEvict && !terminal {
...
145: 			if hasNotSafeToEvictAnnotation(pod) {
146: 				return []*apiv1.Pod{}, []*apiv1.Pod{}, &BlockingPod{Pod: pod, Reason: NotSafeToEvictAnnotation}, fmt.Errorf("pod annotated as not safe to evict present: %s", pod.Name)
147: 			}
148: 		}

https://github.com/kubernetes/autoscaler/blob/d1740df93b76d99bf9302ad0c62978deb0ec1d5b/cluster-autoscaler/utils/drain/drain.go#L108-L148

Here's definition for hasSafeToEvictAnnotation and hasNotSafeToEvictAnnotation:

File: utils/drain/drain.go
315: // This checks if pod has PodSafeToEvictKey annotation
316: func hasSafeToEvictAnnotation(pod *apiv1.Pod) bool {
317: 	return pod.GetAnnotations()[PodSafeToEvictKey] == "true"
318: }
319: 
320: // This checks if pod has PodSafeToEvictKey annotation set to false
321: func hasNotSafeToEvictAnnotation(pod *apiv1.Pod) bool {
322: 	return pod.GetAnnotations()[PodSafeToEvictKey] == "false"
323: }

https://github.com/kubernetes/autoscaler/blob/d1740df93b76d99bf9302ad0c62978deb0ec1d5b/cluster-autoscaler/utils/drain/drain.go#L315-L323

Links are based on d1740df93b76d99bf9302ad0c62978deb0ec1d5b which is the latest commit on master at the time of writing this comment.

image
https://github.com/kubernetes/autoscaler/commits/master

@dprotaso dprotaso added this to the v1.11.0 milestone May 27, 2023
@dprotaso
Copy link
Member

/assign @dprotaso

I'm going to drop the annotation

@paoloyx
Copy link
Author

paoloyx commented May 31, 2023

@dprotaso thank you so much.

Just one last question, and that is because I'm not familiar with Knative releasing and backporting to older versions...in order to get this update, should we upgrade our 1.7.1 version of serving to the next minor one (i.e. the 1.11.0 milestone)?

Or is expected to have an (eventual) manual action to backport the "fix" to 1.7 branch by issuing another patch version?

@dprotaso
Copy link
Member

I'm inclined to backport these changes - but I might wait to include some other fixes.

We only support the last two releases - so we would only backport to 1.9 & 1.10

@dprotaso
Copy link
Member

I'll leave this issue open to track the backport

@dprotaso dprotaso reopened this May 31, 2023
@dprotaso
Copy link
Member

dprotaso commented Jun 6, 2023

Point releases are out - I'm going to close this out

https://github.com/knative/serving/releases/tag/knative-v1.10.2
https://github.com/knative/serving/releases/tag/knative-v1.9.4

thanks for surfacing this problem @paoloyx

@dprotaso dprotaso closed this as completed Jun 6, 2023
@paoloyx
Copy link
Author

paoloyx commented Jun 7, 2023

Thanks to you @dprotaso for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants