-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dra: scheduler queueing hints #119078
dra: scheduler queueing hints #119078
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @sanposhiho |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't got enough time to review it, will do the full review later
// TODO (#113702): can we change this so that such an event does not trigger *all* pods? | ||
// Yes: https://github.com/kubernetes/kubernetes/blob/abcbaed0784baf5ed2382aae9705a8918f2daa18/pkg/scheduler/eventhandlers.go#L70 | ||
{Event: framework.ClusterEvent{Resource: framework.PodSchedulingContext, ActionType: framework.Add | framework.Update}}, | ||
{Event: framework.ClusterEvent{Resource: framework.PodSchedulingContext, ActionType: framework.Add | framework.Update}, QueueingHintFn: pl.isSchedulableAfterPodSchedulingContextChange}, | ||
// A resource might depend on node labels for topology filtering. | ||
// A new or updated node may make pods schedulable. | ||
{Event: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.UpdateNodeLabel}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot we add QueueingHint on Node add event/update event? (Or do you want to do it in another PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be doable, but definitely in a separate PR. Getting the checks wrong is a potential source of bugs, so I don't want to do too much at once.
} | ||
|
||
// isSchedulableAfterClaimChange is invoked whenever a claim changed. It checks whether | ||
// that change made a previously unschedulable pod schedulable. It errs on the side of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// that change made a previously unschedulable pod schedulable. It errs on the side of | |
// that change made a previous unschedulable pod schedulable. It errs on the side of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously
refers to unschedulable
, not pod
. I believe previously
is correct here - but I am not a native speaker.
return framework.QueueSkip | ||
} | ||
|
||
modifiedClaim, ok := newObj.(*resourcev1alpha2.ResourceClaim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use As
function to convert oldObj and newObj.
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/util/utils.go#L167
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking... As
doesn't handle cache.DeletedFinalStateUnknown
. That is okay here (can't occur for newObj
, but for oldObj
in isSchedulableAfterPodSchedulingContextChange
it is relevant.
Shall I add support for it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it, will be in the next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
return nil | ||
} | ||
|
||
// isSchedulableAfterClaimChange is invoked whenever a claim changed. It checks whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
// isSchedulableAfterClaimChange is invoked whenever a claim changed. It checks whether | |
// isSchedulableAfterClaimChange is invoked whenever a claim added or changed. It checks whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or deleted. Instead of listing all possible reasons, I changed this to "is invoked for all claim events reported by an informer".
This also explains why handling cache.DeletedFinalStateUnknown
is required.
// Deleted? That can happen because we ourselves delete the PodSchedulingContext while | ||
// working on the pod. This can be ignored. | ||
if oldObj != nil && newObj == nil { | ||
pl.logger.V(4).Info("PodSchedulingContext for pod got deleted", "pod", klog.KObj(pod)) | ||
return framework.QueueSkip | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deleted event handling can be done around L372 (before converting type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message then shouldn't mention the pod. It's simpler, so I moved the check.
// here is that in a situation where the cluster really is | ||
// full, backoff won't be used because the scheduler keeps | ||
// trying different nodes. This should not happen when it has | ||
// full knowledge about resource availability (= | ||
// PodSchedulingContext.*.UnsuitableNodes is complete) but may happen | ||
// when it doesn't (= PodSchedulingContext.*.UnsuitableNodes had to be | ||
// truncated). | ||
if podScheduling.Spec.SelectedNode != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand this part correctly.
Let's say Node1 - 4 are in the cluster and Node2-4 doesn't have enough capacity. So, the scheduler set podScheduling.Spec.SelectedNode=Node1
. But, later, the device driver said claimStatus.UnsuitableNodes = Node1
. And we reach here and the Pod will get QueueImmediately.
In this case, in the next scheduling cycle, Node2-4 are still rejected by the resource fit plugin or whatever, and Node1 is rejected by the DRA plugin. Meaning, this Pod won't be schedulable until Node2-4 get free space.
But, the Pod could keep getting rescheduled by QueueImmediately.
You mentioned such risk in the above comment, right?
How can we avoid this situation? Could you elaborate This should not happen when it has full knowledge about resource availability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the Pod could keep getting rescheduled by QueueImmediately.
There will only be one scheduling attempt, the one triggered by this cluster event. After that, there won't be any further updates of the PodSchedulingContext
unless a driver reports a change in resource availability, so there will be no further "immediate" queueing.
You mentioned such risk in the above comment, right?
That risk is slightly different: UnsuitableNodes
has a size limit. If a driver cannot report all unsuitable nodes, then the scheduler might alternate between different "selected nodes" that are all unsuitable.
This is an issue that we need to keep an eye on, but I think it can be ignored for now in this PR (alpha stage, not limited to scheduler events).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will only be one scheduling attempt, the one triggered by this cluster event.
Ah, true.
UnsuitableNodes has a size limit. If a driver cannot report all unsuitable nodes, then the scheduler might alternate between different "selected nodes" that are all unsuitable.
OK, that's what you said "truncated". Understood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a comment to describe the risk is acceptable for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. I also added a few more comments in other places.
} | ||
} | ||
|
||
if oldPodScheduling != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reach here, oldPodScheduling != nil
is always true due to L442.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you have forgotten to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, oldPodScheduling
is nil for a "create" event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, OK never mind. 🙏
if loggerV := pl.logger.V(6); loggerV.Enabled() { | ||
loggerV.Info("PodSchedulingContext for pod with no relevant changes, maybe schedule", "pod", klog.KObj(pod), "podSchedulingDiff", cmp.Diff(oldPodScheduling, podScheduling)) | ||
} else { | ||
pl.logger.V(5).Info("PodSchedulingContext for pod with no relevant changes, maybe schedule", "pod", klog.KObj(pod)) | ||
} | ||
return framework.QueueAfterBackoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case we could reach here? Cannot we return QueueSkip if we know it's non relevant change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know for sure that the change can be ignored. When in doubt, "backoff" (and thus try later) IMHO is the better choice and matches what the scheduler would do without scheduling hints.
I'll change the wording to "with unknown changes" instead of "no relevant changes", that's more accurate for this situation. The diff is included at V(6)
precisely because a developer might want to know more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense
09496db
to
28d3ae4
Compare
@sanposhiho : I pushed an update, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nits
} | ||
} | ||
|
||
if oldPodScheduling != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you have forgotten to fix this?
loggerV.Info("PodSchedulingContext for pod with unknown changes, maybe schedule", "pod", klog.KObj(pod), "podSchedulingDiff", cmp.Diff(oldPodScheduling, podScheduling)) | ||
} else { | ||
pl.logger.V(5).Info("PodSchedulingContext for pod with unknown changes, maybe schedule", "pod", klog.KObj(pod)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some comment to describe that if we reach here, something unknown happens and that's why we return QueueAfterBackoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// here is that in a situation where the cluster really is | ||
// full, backoff won't be used because the scheduler keeps | ||
// trying different nodes. This should not happen when it has | ||
// full knowledge about resource availability (= | ||
// PodSchedulingContext.*.UnsuitableNodes is complete) but may happen | ||
// when it doesn't (= PodSchedulingContext.*.UnsuitableNodes had to be | ||
// truncated). | ||
if podScheduling.Spec.SelectedNode != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a comment to describe the risk is acceptable for now?
}, | ||
} | ||
|
||
for name, tc := range testcases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do you want to make this test parallel as well? (because other UTs in this test do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
28d3ae4
to
e76fcab
Compare
@sanposhiho: updated once more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is clear to me. Thanks!
/lgtm
LGTM label has been added. Git tree hash: d9fa124cb687dd64b7c920d41e754a031f5a256a
|
@kubernetes/sig-scheduling-leads @kerthcet |
// TODO: the runtime should set up logging for each plugin, including | ||
// adding a name for each one (same as in kube-controller-manager). | ||
return NewWithLogger(klog.TODO(), plArgs, fh, fts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an issue for this TODO by the way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically for this one, but there are PRs pending for the conversion to contextual logging. Once those are complete, we can search for klog.TODO and fix it if not done yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems better to pass-in a logger here, rather than a context like this https://github.com/kubernetes/kubernetes/pull/116884/files#diff-b5d92dd0897babb368df8c1aad9e95c957eca378bd46481d6d8501011f722715R77-R81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever gets done here must get coordinated when converting kube-scheduler to contextual logging. Let's not open that rathole right now.
pkg/scheduler/util/utils.go
Outdated
@@ -164,6 +165,8 @@ func IsScalarResourceName(name v1.ResourceName) bool { | |||
// As converts two objects to the given type. | |||
// Both objects must be of the same type. If not, an error is returned. | |||
// nil objects are allowed and will be converted to nil. | |||
// For oldObj, cache.DeletedFinalStateUnknown is handled and the | |||
// pobject stored in it will be converted instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pobject stored in it will be converted instead. | |
// pobject stored in it will be converted instead. | |
// pod object stored in it will be converted instead. |
I will take a look tomorrow, I still have jobs at hand right now. |
func (pl *dynamicResources) PreEnqueue(ctx context.Context, pod *v1.Pod) (status *framework.Status) { | ||
logger := klog.FromContext(ctx) | ||
defer func() { | ||
logger.V(5).Info("PreEnqueue", "pod", klog.KObj(pod), "status", status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this because we logged them at the calling side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if !usesClaim { | ||
// This was not the claim the pod was waiting for. | ||
pl.logger.V(6).Info("unrelated claim got modified", "pod", klog.KObj(pod), "claim", klog.KObj(modifiedClaim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not still v(4), I think it's similar with other logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is going to much more common than the other ones and not as interesting as those. If a developer wants to know everything, they should use -v6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other examples:
pkg/kubelet/pleg/evented.go: if klog.V(6).Enabled() {
pkg/kubelet/pleg/evented.go- klog.ErrorS(err, "Evented PLEG: error generating pod status from the received event", "podUID", podID, "podStatus", status)
--
pkg/kubelet/pleg/evented.go: if klogV := klog.V(6); klogV.Enabled() {
pkg/kubelet/pleg/evented.go- klogV.InfoS("Evented PLEG: Generated pod status from the received event", "podUID", podID, "podStatus", status)
--
pkg/kubelet/pleg/generic.go: if klog.V(6).Enabled() {
pkg/kubelet/pleg/generic.go- klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
--
pkg/kubelet/pleg/generic.go: if klogV := klog.V(6); klogV.Enabled() {
pkg/kubelet/pleg/generic.go- klogV.InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
--
staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go: if klogV := klog.V(6); klogV.Enabled() {
staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go- chosenQueue := qs.queues[bestQueueIdx]
} | ||
|
||
pl.logger.V(4).Info("status of claim for pod got updated", "pod", klog.KObj(pod), "claim", klog.KObj(modifiedClaim)) | ||
return framework.QueueImmediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be strict about QueueImmediately
, I didn't see any special reason here, do we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed to make the scheduler respond without backoff delay when a claim got allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that this was discussed in one of the previous PRs which had this code and the conclusion there was that queuing immediately is okay - I can try to find that discussion again if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueueImmediately in this case should be OK.
Previous discussion: #117561 (comment)
oldPodScheduling, newPodScheduling, err := schedutil.As[*resourcev1alpha2.PodSchedulingContext](oldObj, newObj) | ||
if err != nil { | ||
// Shouldn't happen. | ||
pl.logger.Error(nil, "isSchedulableAfterPodSchedulingChange") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the same with isSchedulableAfterClaimChange
as isSchedulableAfterPodSchedulingContextAddOrUpdate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the function name needs to be updated - fixed.
podScheduling := newPodScheduling // Never nil because deletes are handled above. | ||
|
||
if podScheduling.Name != pod.Name || podScheduling.Namespace != pod.Namespace { | ||
pl.logger.V(7).Info("PodSchedulingContext for unrelated pod got modified", "pod", klog.KObj(pod), "podScheduling", klog.KObj(podScheduling)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we never use v(7) in scheduler before, I think it's ok but why v(5) or v(10) not work for you? I generally use v(4) for debugging, v(5) for tracing as suggested by https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#what-method-to-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing >5 is standardized. Each additional increment just adds more information. Therefore V(6)
seems fine to me. V(10)
tends to be extremely noisy, which would make debugging this code at -v10
not very practical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #119078 (comment) for other examples.
return framework.QueueSkip | ||
} | ||
|
||
if oldPodScheduling == nil /* create */ || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about AllocationMode
, is it immutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
e76fcab
to
fd164e3
Compare
/retest |
This makes some complex values a bit more readable.
Informer callbacks must be prepared to get cache.DeletedFinalStateUnknown as the deleted object. They can use that as hint that some information may have been missed, but typically they just retrieve the stored object inside it.
This is a combination of two related enhancements: - By implementing a PreEnqueue check, the initial pod scheduling attempt for a pod with a claim template gets avoided when the claim does not exist yet. - By implementing cluster event checks, only those pods get scheduled for which something changed, and they get scheduled immediately without delay.
fd164e3
to
6f1a295
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: bea34d1676accc5534cbf00a91996c9f1de486f6
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Scheduling with less overhead and without delays when ResourceClaims are involved.
Which issue(s) this PR fixes:
Related-to: #118893 (comment)
Special notes for your reviewer:
This is a combination of two related enhancements:
attempt for a pod with a claim template gets avoided when the claim
does not exist yet.
scheduled for which something changed, and they get scheduled
immediately without delay.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: