Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically inject default queue if not provided #598

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KPostOffice
Copy link
Collaborator

Issue link

https://issues.redhat.com/browse/RHOAIENG-6467

What changes have been made

If no LocalQueue is specified by RayCluster and a default LocalQueue is specified, then the default LocalQueue will be added to the RayCluster spec

Verification steps

  1. Install Kueue, Ray, and this branch of the CFO
  2. Create a default queue using the label {"kueue.x-k8s.io/default-queue": 'true'}
  3. create a RayCluster with no LocalQueue specified
    1. see that the default queue is added to the RayCluster's labels
  4. create a RayCluster with a LocalQueue specified
    1. see that the specified LocalQueue remains

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot requested review from dimakis and Fiona-Waters July 22, 2024 19:49
Copy link

openshift-ci bot commented Jul 22, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from kpostoffice. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@KPostOffice KPostOffice force-pushed the rhoaieng-6467-localqueue2operator branch from 46358db to c5e9a48 Compare July 22, 2024 19:50
@sutaakar
Copy link
Contributor

@KPostOffice can you add unit tests in raycluster_webhook_test.go?

@@ -77,6 +81,30 @@ func (w *rayClusterWebhook) Default(ctx context.Context, obj runtime.Object) err
rayCluster.Spec.HeadGroupSpec.Template.Spec.ServiceAccountName = rayCluster.Name + "-oauth-proxy"
}

// add default queue label if not present
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RayCluster is owned by a Kueue-managed object (eg an AppWrapepr) then injecting the queue-name label on a RayCluster that does not have one is not correct. This functionality at least needs to be guarded by configuration on the operator that allows it to be disabled. To actually be robust it should be looking to see if the RayCluster is not owned by an AppWrapper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might actually warrant a webhook on the Workload CRD and checking if spec.queue is set. I don't understand why this approach is incorrect however, using a mutating webhook to inject defaults is a normal use case for mutating webhooks and shouldn't cause any issues since the value is changed prior to being applied to etcd meaning it shouldn't cause issues with re-reconciliation loops. Does this have something to do with AppWrapper functionality that I'm missing?

Copy link
Collaborator

@dgrove-oss dgrove-oss Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment comes from Kueue's logic for supporting parent/child relationships for workloads. It has code in its defaulting webhooks to detect that the object being created is a child of a Kueue managed parent object (encoded by a controller reference) and to not set the default value for suspend in a child job.
https://github.com/kubernetes-sigs/kueue/blob/release-0.8/pkg/controller/jobframework/defaults.go#L23-L34

I'd suggest adding a similar check here to avoid adding a label which is going to have no impact at runtime and might be confusing to people. If the ray cluster is controlled by an appwrapper, then it's the queue-name that is on the appwrapper that matters. The queue-name (or lack of one) on the ray cluster does nothing.

You can't easily call IsOwnerManagedByKueue(owner), but you could either add a check for the appwrapper GVK or just check for whether the ray cluster is controlled by anything at all (I think for standalone clusters, this should be no).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @dgrove-oss isn't questioning the use of the webhook, but raising the point the logic should only apply to the top-level resource managed by Kueue in the ownership graph.

Copy link
Contributor

@astefanutti astefanutti Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the functionality that this PR adds for RayClusters could be generalised and proposed upstream?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An upstream generalization would be nice. For AppWrappers, we went with adding an optional defaultQueueName to the AppWrapper operator configuration. It's less customizable (uses a single default queue name for all namespaces), but it covered the use case we had for MLBatch. If an AppWrapper doesn't have a queue-name and the operator is configured with a default then the webhook injects it. But it only works for AppWrappers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see thanks @dgrove-oss

}
err := w.Client.List(ctx, &kueuev1beta1.LocalQueueList{})
if err != nil {
rayclusterlog.Error(err, "Failed to list LocalQueues, Kueue CRD might not be installed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kueue is optional, so the logic should pass-through the error.

_, ok := rayCluster.Labels["kueue.x-k8s.io/queue-name"]
if !ok {
// check if CRD Kueue LocalQueue exists
localQueues := &kueuev1beta1.LocalQueueList{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could metav1.PartialObjectMetadata be used there, so it uses the partial metadata client / cache?

err := w.Client.List(ctx, localQueues)
if err == nil {
for _, localQueue := range localQueues.Items {
is_default, ok := localQueue.Labels["kueue.x-k8s.io/default-queue"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use something like kueue.openshift.ai/default-queue instead. We can also explore the idea upstream in parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label is already being used in CF-sdk I'm not sure when it was added, but changing it could break users

@KPostOffice KPostOffice force-pushed the rhoaieng-6467-localqueue2operator branch from c5e9a48 to 1433b66 Compare July 24, 2024 17:14
@KPostOffice KPostOffice force-pushed the rhoaieng-6467-localqueue2operator branch from 1433b66 to 9927510 Compare July 24, 2024 17:15
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants