-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: handle non graceful node shutdowns #4260
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -98,6 +98,7 @@ SIG Architecture for cross-cutting KEPs). | |||||
- [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler) | ||||||
- [Resource allocation and usage flow](#resource-allocation-and-usage-flow) | ||||||
- [Scheduled pods with unallocated or unreserved claims](#scheduled-pods-with-unallocated-or-unreserved-claims) | ||||||
- [Handling non graceful node shutdowns](#handling-non-graceful-node-shutdowns) | ||||||
- [API](#api) | ||||||
- [resource.k8s.io](#resourcek8sio) | ||||||
- [core](#core) | ||||||
|
@@ -1162,6 +1163,20 @@ Once all of those steps are complete, kubelet will notice that the claims are | |||||
ready and run the pod. Until then it will keep checking periodically, just as | ||||||
it does for other reasons that prevent a pod from running. | ||||||
|
||||||
### Handling non graceful node shutdowns | ||||||
|
||||||
When a node is shut down unexpectedly and is tainted with an `out-of-service` | ||||||
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same thing here. |
||||||
all running pods on the node will be deleted by the GC controller and the | ||||||
resources used by the pods will be deallocated. However, they will not be | ||||||
un-prepared as the node is down and Kubelet is not running on it. | ||||||
|
||||||
Resource drivers should be able to handle this situation correctly and | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If non-blocking: is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently this is the only one case I'm aware of.
It depends on a resource type. For local resources not much can be done if node is powered off, but something can be done if it's just a Kubelet crash. For network-attached resources DRA controller can theoretically detach them from the node. However, all these cases are not generic enough to give recommendations. Plugin authors should know how to handle this type of cases and in most cases it depends on a particular hardware setup, I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the reason I am asking is to understand if we need to guarantee that in normal case all callbacks will be called? And if so - is there any guarantees on timing? Can they be called super fast one after another - how much synchronization is needed there. Can they somehow be called in opposite order? (sorry, I haven't looked at implementation so my questions may be completely out of context). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's already guaranteed that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me provide a bit more info on this.
In most cases it means that |
||||||
should not expect `UnprepareNodeResources` to be always called. | ||||||
If resources are unprepared when `Deallocate` is called, `Deallocate` | ||||||
might need to perform additional actions to correctly deallocate | ||||||
resources. | ||||||
|
||||||
### API | ||||||
|
||||||
The PodSpec gets extended. To minimize the changes in core/v1, all new types | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'non graceful' is used in the KEP, that's why I decided to use it here and in the e2e test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP issues uses "non-graceful", as does the KEP README in one place - looks like the original authors weren't sure about the right spelling.
"non-graceful" feels more right to me, but I'm not a native speaker and English is creative... Let's leave it in this PR as you have it now ("non graceful").