-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling Pods with dynamic resources and using the Pod .spec.nodeName
field fails
#114005
Comments
/triage accepted One possible solution would be to have a new controller which specifically watches for pods in this state (nodeName set and reference claims). That controller then must trigger delayed allocation (if needed) by creating a PodScheduling object with selectedNode set. After allocation it must add the pod to the reservedFor list. kubelet will immediately try to run the pod once it sees it because of the nodeName field. It will refuse to run it until eventually all claims are allocated and reserved. Obviously using a scheduler which is not aware of dynamic resources is problematic. There is no guarantee that the chosen node even has the hardware that is needed for the pod and that it isn't in use. It would be good to hear from whoever has this use case whether the solution outlined above would work in practice. The alternative is to build support for dynamic resource allocation into that other scheduler. |
/retitle Scheduling Pods with dynamic resources and using the Pod |
.spec.nodeName
field fails
Should this be a beta blocker? Or we just want to mark this an unsupported behavior long-term? |
My approach: provide a sample / “contrib” implementation of a restriction that prohibits specifying If ValidatingAdmissionPolicy gains a way to restrict this, we can provide a contrib ValidatingAdmissionPolicy as well, and recommend that tools for managing cluster lifecycle put that in place. That'd be enough to move to beta. IMO. |
I'm not sure whether it needs to be a blocker, but my intention is to work on this soonish - definitely before beta. |
I intend to support this instead of prohibiting it. |
It sounds like it could be a non-trivial change to the design. Once you have something, please send a PR to the KEP. |
Code PR: #118209 |
What happened?
The combination of using the Pod spec nodeName field with the
DynamicResources
scheduler plugin (currently feature-gated) doesn't work. No surprise really as the nodeName field effective bypasses theDynamicResources
scheduler plugin running, and something rather clever would need to be invented to make things to work. But this is still a bit of a regression compared to the old device plugin model which doesn't have a similar limitation. @pohly as the author of the dynamic resources feature can explain what could maybe be created to fix this.Granted, forcing scheduling even with a device plugin resource may or may not succeed at the node level depending on the availability of said plugin resources at the node, but with dynamic resources the failure seems to be currently guaranteed.
The use-case for this is such that the Kubernetes scheduler is effectively being preceded by some other scheduler which runs earlier (say, some HPC framework) which is giving the Pods the node names and somehow knows that there will be enough resources, and then the workloads ought to land in the specified node and the container related work is then handled normally from there on. So basically Kubernetes wouldn't do much in terms of scheduling, but takes over from there.
What did you expect to happen?
TBH I expected it to fail and it did. The resource claims won't work, since the scheduler plugin isn't getting a chance to do its thing.
How can we reproduce it (as minimally and precisely as possible)?
Just use DRA with Pod spec nodeName field, one needs a DRA driver obviously.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: