-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3063: dra: pre-scheduled pods #4063
Conversation
e5b95b1
to
b18ef8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, but I think this falls under sig-node more than sig-scheduling.
Yes. Both kubelet and pkg/controller/resourceclaim are SIG Node. Do you want me to find a different reviewer also for kubernetes/kubernetes#118209? If I remember correctly, you are also moonlighting as reviewer for kcm, so I wasn't sure how much you might want to be involved in that. |
I'm not an approver in kcm, so you would still need another person. But also the changes you proposed still belong to SIG Node https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/resourceclaim/OWNERS |
@@ -1972,6 +2022,10 @@ exist at all must not be allowed to run. Instead, a suitable event must be | |||
emitted which explains the problem. Such a situation can occur as part of | |||
downgrade scenarios. | |||
|
|||
In addition, kubelet informs about the claim status with a | |||
`ResourceClaimsReady` `PodConditionType`. It's `false` when NodePrepareResource | |||
still needs to be called. It's `true` when that has succeeded for all claims. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on whether such a condition is useful? I'm undecided myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking: it's all-or-nothing. So, if something is interested in a particular claim, it's not helpful. If you want the side effect of a metric changing to see how long it's taking Pods to get to this point, it might be.
Maybe we can provide another instrumentation approach though. For example, one or more of:
- increment a per-node counter when a Pod is accepted with pending claims, and increment a different counter when claims are resolved
- put the
.status
in the ResourceClaim instead, with an entry for each associated Pod;
teach tooling to look there - fire an Event [per claim, per Pod] to note that a resource claim has succeeded and is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(thinking ahead)
Could we ever have a situation where a Pod that is already running might make a resource claim?
For example, a Pod that is using one cryptographic key identifies that it needs to make another (key rotation) and so requests a second one, by directly calling create for the ResourceClaim API.
In that case, we wouldn't expect to update .status
for the Pod; we might not even have the associated Pod object identity available, depending on how it has authenticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only real use case that we have right now is "notify the user why the pod is not starting". An event satisfies that requirement.
A condition or dedicated fields in the claim status seem more about "notify some component so that it can take some action" - we don't have use cases for that right now. My preference is to not add any conditions or fields until we know more about how they would be used.
Could we ever have a situation where a Pod that is already running might make a resource claim?
I suspect that this would be very tricky to implement. Right now, the pod.spec.resourceClaims
are fixed. We would have to allow modifying that and then find a way how the container runtime can be told to update the containers, ideally without having to restart them.
Mutating some existing claim is already possible, to some extend. For example, if the claim causes some directory to be mounted, the resource driver can change the content of what was mounted. However, the same caveat applies: modifying the container sandbox in response to some claim changes is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the condition part. That needs more thought about use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the challenges on the debuggability today is the user / dev don't know why a pod being scheduled but not running. The scenario of an unallocated or unreserved claims makes the situation even more complicated. But I agreed this need more thoughts.
/cc |
b18ef8f
to
5b393bb
Compare
/assign @klueska |
Would this need another PRR @johnbelamaric ? |
My two cents: this is a relatively minor change that doesn't impact any of the PRR aspects. The same API calls are made as before, they just come from kube-controller-manager instead of kube-scheduler. The actual content (= size of PodSchedulingContext) is smaller than normal. |
Should we also include the updates to the kubelet plugin API in this or open a separate PR for that? |
Given the focus of this PR I prefer a separate PR for kubelet changes. |
@klueska , yep, on it. I somehow thought this was updated already. |
This is an error scenario, but dealing with it automatically when it arises is still useful and not too hard, so it should be worthwhile.
5b393bb
to
96c54f5
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: pods with
spec.nodeName
set while claims are not reserved is an error scenario, but dealing with it automatically when it arises is still useful and not too hard, so it should be worthwhile.Issue link: DRA: control plane controller ("classic DRA") #3063
Other comments:
See dra: pre-scheduled pods kubernetes#118209 for the kube-controller-manager implementation
/cc @alculquicondor