Skip to content

Comments

Fix launcher job scheduling directives when unsuspending#772

Merged
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
GonzaloSaez:fix_kueue_launcher_suspended
Feb 17, 2026
Merged

Fix launcher job scheduling directives when unsuspending#772
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
GonzaloSaez:fix_kueue_launcher_suspended

Conversation

@GonzaloSaez
Copy link
Contributor

@GonzaloSaez GonzaloSaez commented Feb 15, 2026

This should address #770.

If an MPIJob is suspended and then unsuspended (i.e. like Kueue would do during workload creation or when preemption occurs), the launcher job would not have the correct scheduling directives after launch job unsuspension. We need to perform the same operations as JobSet does: https://github.com/kubernetes-sigs/jobset/blob/f1bbaaef64b2a56c4721843b1d83750d21227948/pkg/controllers/jobset_controller.go#L537

@tenzen-y
Copy link
Member

@GonzaloSaez could you sign DCO?

@tenzen-y
Copy link
Member

Avoid creating the launcher job if the MPIJob starts suspended. It adds load to the apiserver for not much value.

@GonzaloSaez Could you keep the current mechanism (creating a batch/v1 Job even when the MPIJob is suspended)?
This semantic change could potentially be a breaking change that can not be released as part of the same major version.

@tenzen-y
Copy link
Member

@GonzaloSaez could you sign DCO?

You can follow https://github.com/kubeflow/mpi-operator/pull/772/checks?check_run_id=63645778871 steps to sign DCO.

@GonzaloSaez GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from 880261d to fe8d324 Compare February 15, 2026 17:39
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GonzaloSaez Thank you for working on this problem.
Basically, LGTM.

Additionally, could you add an integration test case to https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

// so we must clear it first via a status sub-resource update (consistent with JobSet).
if launcher.Status.StartTime != nil {
launcher.Status.StartTime = nil
if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
var err error
if launcher, err = c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {

Could you update launcher after startTime update to avoid coflict while scheduling directive update?

// syncLauncherSchedulingDirectives updates the mutable scheduling directives (as per KEP-2926) on
// the launcher Job's pod template to match the desired template.
func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
if launcher.Spec.Template.Labels == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if launcher.Spec.Template.Labels == nil {
if desired.Labels != nil && launcher.Spec.Template.Labels == nil {

Optimizing initialization would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and re-used some of the jobset code, lmk what you think please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea sounds reasonable.
I left a comment for improvement: #772 (comment)

// the launcher Job's pod template to match the desired template.
func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
if launcher.Spec.Template.Labels == nil {
launcher.Spec.Template.Labels = make(map[string]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
launcher.Spec.Template.Labels = make(map[string]string)
launcher.Spec.Template.Labels = make(map[string]string, len(desired.Labels))

Comment on lines 1655 to 1662
if desired.Annotations != nil {
if launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if desired.Annotations != nil {
if launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}
}
if desired.Annotations != nil && launcher.Spec.Template.Annotations == nil {
launcher.Spec.Template.Annotations = make(map[string]string)
}
for k, v := range desired.Annotations {
launcher.Spec.Template.Annotations[k] = v
}

The range loop will be executed only when the desired.Annotaions are not null.

Comment on lines 1650 to 1655
mergeMaps := func(old, new map[string]string) map[string]string {
merged := make(map[string]string, max(len(old), len(new)))
maps.Copy(merged, old)
maps.Copy(merged, new)
return merged
}
Copy link
Member

@tenzen-y tenzen-y Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you implement mergeMaps[K comparable, V any](a, b map[K]V]) map[K]V separately instead of an anonymous function?

func mergeMaps[K comparable, V any](a, b map[K]V]) map[K]V {
	merged := make(map[K]V, max(len(a), len(b)))
	maps.Copy(merged, a)
	maps.Copy(merged, b)
	return merged
}

@tenzen-y
Copy link
Member

@GonzaloSaez, some of the CI jobs failed. Please take a look.

"kueue.x-k8s.io/workload": "my-workload",
}
launcherTemplate.Spec.NodeSelector = map[string]string{
"cloud.google.com/gke-accelerator": "nvidia-tesla-t4",
Copy link
Member

@tenzen-y tenzen-y Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"cloud.google.com/gke-accelerator": "nvidia-tesla-t4",
"example.com/accelerator": "example-model",

Could you avoid the vendor-specific one?

// launcher Job gets the updated scheduling directives on second resume.
mpiJobLauncherTemplate := &mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher].Template
mpiJobLauncherTemplate.ObjectMeta.Labels["foo"] = "baz"
mpiJobLauncherTemplate.Spec.NodeSelector["cloud.google.com/gke-accelerator"] = "nvidia-tesla-t4-v2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mpiJobLauncherTemplate.Spec.NodeSelector["cloud.google.com/gke-accelerator"] = "nvidia-tesla-t4-v2"
mpiJobLauncherTemplate.Spec.NodeSelector["example.com/accelerator"] = "example-model"

ditto

Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
@GonzaloSaez GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from 9228f9b to 3e448c5 Compare February 16, 2026 19:30
@GonzaloSaez
Copy link
Contributor Author

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

@tenzen-y
Copy link
Member

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

Thank you for verifying the TAS test. Yes, ideally, we would like to confirm the test case, but let use try that separate from this enhancement.

I will also check if any are missing.

}

func mergeMaps[K comparable, V any](a, b map[K]V) map[K]V {
merged := make(map[K]V, max(len(a), len(b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
merged := make(map[K]V, max(len(a), len(b)))
merged := make(map[K]V, len(a)+len(b))

Sorry for the confusion. As I check this code again, shouldn't this be the sum of a and b?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, if a and b have the same or very similar keys then we'd over-allocating. Lmk what you prefer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the caller functions of mergeMaps should avoid consider internal implementations which means even the length of a and b are pretty different should be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely, in the worst case (a and b are mostly the same and both have very big lengths), it will allocate too redundant memory.

Copy link
Member

@tenzen-y tenzen-y Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, both approaches (max(a, b) and sum(a, b)) have different problems, and I don't want to waste time on trivial discussions. So, I would approve the current your approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GonzaloSaez Thank you for addressed comment.
Otherwise LGTM

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit c72caac into kubeflow:master Feb 17, 2026
10 checks passed
@tenzen-y
Copy link
Member

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

I manually checked the E2E case, and verifications succeeded. One thing is that the current expected result is not correct. The correct one is the following:

			//	 wantAssignment := map[string]string{
			//		 "launcher/0": "kind-worker",
			//		 "worker/1":   "kind-worker",
			//		 "worker/2":   "kind-worker2",
			//		 "worker/3":   "kind-worker3",
			//	 }

This is expected because the MPIJob resource requirements are mixed across roles.
I will open Kueue PR to update MPIOperator, fix the case, and enable the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants