Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: handle non graceful node shutdowns #4260

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions keps/sig-node/3063-dynamic-resource-allocation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ SIG Architecture for cross-cutting KEPs).
- [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler)
- [Resource allocation and usage flow](#resource-allocation-and-usage-flow)
- [Scheduled pods with unallocated or unreserved claims](#scheduled-pods-with-unallocated-or-unreserved-claims)
- [Handling non graceful node shutdowns](#handling-non-graceful-node-shutdowns)
- [API](#api)
- [resource.k8s.io](#resourcek8sio)
- [core](#core)
Expand Down Expand Up @@ -1162,6 +1163,20 @@ Once all of those steps are complete, kubelet will notice that the claims are
ready and run the pod. Until then it will keep checking periodically, just as
it does for other reasons that prevent a pod from running.

### Handling non graceful node shutdowns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Handling non graceful node shutdowns
### Handling non-graceful node shutdowns

Copy link
Contributor Author

@bart0sh bart0sh Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'non graceful' is used in the KEP, that's why I decided to use it here and in the e2e test code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP issues uses "non-graceful", as does the KEP README in one place - looks like the original authors weren't sure about the right spelling.

"non-graceful" feels more right to me, but I'm not a native speaker and English is creative... Let's leave it in this PR as you have it now ("non graceful").


When a node is shut down unexpectedly and is tainted with an `out-of-service`
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),
taint with NoExecute effect as explained in the [Non-graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

all running pods on the node will be deleted by the GC controller and the
resources used by the pods will be deallocated. However, they will not be
un-prepared as the node is down and Kubelet is not running on it.

Resource drivers should be able to handle this situation correctly and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Deallocate is called without UnprepareNodeResources does it automatically mean a broken node? Or there are other cases like this?

non-blocking: is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources?

Copy link
Contributor Author

@bart0sh bart0sh Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Deallocate is called without UnprepareNodeResources does it automatically mean a broken node? Or there are other cases like this?

Currently this is the only one case I'm aware of.

is there any recommendations that can be shared here on the best practices on implementing those? If there is no guarantee, what logic should be implemented in UnprepareNodeResources?

It depends on a resource type. For local resources not much can be done if node is powered off, but something can be done if it's just a Kubelet crash. For network-attached resources DRA controller can theoretically detach them from the node. However, all these cases are not generic enough to give recommendations. Plugin authors should know how to handle this type of cases and in most cases it depends on a particular hardware setup, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I am asking is to understand if we need to guarantee that in normal case all callbacks will be called? And if so - is there any guarantees on timing? Can they be called super fast one after another - how much synchronization is needed there. Can they somehow be called in opposite order? (sorry, I haven't looked at implementation so my questions may be completely out of context).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already guaranteed that PrepareResources is called when container is created and UnprepareResources - when container is terminated. It's a part of the initial DRA implementation. Description of the non graceful node shutdown case is just a way to explain how DRA plugins are expected to behave when UnprepareResources is not called due to unexpected node shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me provide a bit more info on this.
NodePrepareResources and NodeUnprepareResources calls are expected to do as little work as possible as specified in the KEP

This operation SHALL do as little work as possible as it’s called after a pod is scheduled to a node. All potentially failing operations SHALL be done during allocation phase.

In most cases it means that NodePrepareResources only creates CDI file and returns its fqdn to the Kubelet and NodeUnprepareResources removes the file. When the node is rebooted CDI files are removed as they're usually placed at /var/run/cdi/. In this particular case it means that even if NodeUnprepareResources is not called, because of unexpected node shutdown, the file will be removed anyway on the node reboot.

should not expect `UnprepareNodeResources` to be always called.
If resources are unprepared when `Deallocate` is called, `Deallocate`
might need to perform additional actions to correctly deallocate
resources.

### API

The PodSpec gets extended. To minimize the changes in core/v1, all new types
Expand Down