-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: handle non graceful node shutdowns #4260
DRA: handle non graceful node shutdowns #4260
Conversation
/assign @pohly |
6d43d94
to
51670a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, some minor spelling suggestions.
@@ -1162,6 +1163,18 @@ Once all of those steps are complete, kubelet will notice that the claims are | |||
ready and run the pod. Until then it will keep checking periodically, just as | |||
it does for other reasons that prevent a pod from running. | |||
|
|||
### Handling non graceful node shutdowns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Handling non graceful node shutdowns | |
### Handling non-graceful node shutdowns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'non graceful' is used in the KEP, that's why I decided to use it here and in the e2e test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP issues uses "non-graceful", as does the KEP README in one place - looks like the original authors weren't sure about the right spelling.
"non-graceful" feels more right to me, but I'm not a native speaker and English is creative... Let's leave it in this PR as you have it now ("non graceful").
@@ -1162,6 +1163,18 @@ Once all of those steps are complete, kubelet will notice that the claims are | |||
ready and run the pod. Until then it will keep checking periodically, just as | |||
it does for other reasons that prevent a pod from running. | |||
|
|||
### Handling non graceful node shutdowns | |||
|
|||
When a node is shutdown unexpectedly and is tained with an `out-of-service` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a node is shutdown unexpectedly and is tained with an `out-of-service` | |
When a node is shut down unexpectedly and is tainted with an `out-of-service` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thank you for the review!
### Handling non graceful node shutdowns | ||
|
||
When a node is shutdown unexpectedly and is tained with an `out-of-service` | ||
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), | |
taint with NoExecute effect as explained in the [Non-graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
resources used by the pods will be deallocated. However, they will not be | ||
un-prepared as the node is down and Kubelet is not running on it. | ||
|
||
Resource drivers should be able to handle this situation correctly and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Deallocate
is called without UnprepareNodeResources
does it automatically mean a broken node? Or there are other cases like this?
non-blocking: is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Deallocate is called without UnprepareNodeResources does it automatically mean a broken node? Or there are other cases like this?
Currently this is the only one case I'm aware of.
is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources?
It depends on a resource type. For local resources not much can be done if node is powered off, but something can be done if it's just a Kubelet crash. For network-attached resources DRA controller can theoretically detach them from the node. However, all these cases are not generic enough to give recommendations. Plugin authors should know how to handle this type of cases and in most cases it depends on a particular hardware setup, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I am asking is to understand if we need to guarantee that in normal case all callbacks will be called? And if so - is there any guarantees on timing? Can they be called super fast one after another - how much synchronization is needed there. Can they somehow be called in opposite order? (sorry, I haven't looked at implementation so my questions may be completely out of context).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already guaranteed that PrepareResources
is called when container is created and UnprepareResources
- when container is terminated. It's a part of the initial DRA implementation. Description of the non graceful node shutdown case is just a way to explain how DRA plugins are expected to behave when UnprepareResources is not called due to unexpected node shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me provide a bit more info on this.
NodePrepareResources
and NodeUnprepareResources
calls are expected to do as little work as possible as specified in the KEP
This operation SHALL do as little work as possible as it’s called after a pod is scheduled to a node. All potentially failing operations SHALL be done during allocation phase.
In most cases it means that NodePrepareResources
only creates CDI file and returns its fqdn to the Kubelet
and NodeUnprepareResources
removes the file. When the node is rebooted CDI files are removed as they're usually placed at /var/run/cdi/. In this particular case it means that even if NodeUnprepareResources
is not called, because of unexpected node shutdown, the file will be removed anyway on the node reboot.
51670a6
to
4240c3b
Compare
|
||
Resource drivers should be able to handle this situation correctly and | ||
should not expect `UnprepareNodeResources` to be always called before | ||
`Deallocate`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sentence saying that UnprepareNodeResources
may be omitted, but can be read as "they can be called in opposite order". Please make it more explicit what sentence means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out to this. Re-frazed, PTAL.
Added a section about handling non graceful node shutdowns (see https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown for more details)
4240c3b
to
aa42236
Compare
@SergeyKanzhelev any other questions/concerns? if not, can you lgtm/approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @mrunalp @dchen1107 |
@mrunalp @dchen1107 @derekwaynecarr Can you approve this? It's a minor change and it doesn't require any code changes. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bart0sh, mrunalp, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
120421: interaction with unexpected node shutdown KEP
One-line PR description: Added a section about handling non graceful node shutdowns (see https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown for more details
Other comments: see https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown for more details