Fix launcher job scheduling directives when unsuspending by GonzaloSaez · Pull Request #772 · kubeflow/mpi-operator

GonzaloSaez · 2026-02-15T00:59:17Z

This should address #770.

If an MPIJob is suspended and then unsuspended (i.e. like Kueue would do during workload creation or when preemption occurs), the launcher job would not have the correct scheduling directives after launch job unsuspension. We need to perform the same operations as JobSet does: https://github.com/kubernetes-sigs/jobset/blob/f1bbaaef64b2a56c4721843b1d83750d21227948/pkg/controllers/jobset_controller.go#L537

tenzen-y · 2026-02-15T09:57:54Z

@GonzaloSaez could you sign DCO?

tenzen-y · 2026-02-15T09:59:57Z

Avoid creating the launcher job if the MPIJob starts suspended. It adds load to the apiserver for not much value.

@GonzaloSaez Could you keep the current mechanism (creating a batch/v1 Job even when the MPIJob is suspended)?
This semantic change could potentially be a breaking change that can not be released as part of the same major version.

tenzen-y · 2026-02-15T11:17:21Z

@GonzaloSaez could you sign DCO?

You can follow https://github.com/kubeflow/mpi-operator/pull/772/checks?check_run_id=63645778871 steps to sign DCO.

tenzen-y

@GonzaloSaez Thank you for working on this problem.
Basically, LGTM.

Additionally, could you add an integration test case to https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

tenzen-y · 2026-02-16T07:59:53Z

pkg/controller/mpi_job_controller.go

+			// so we must clear it first via a status sub-resource update (consistent with JobSet).
+			if launcher.Status.StartTime != nil {
+				launcher.Status.StartTime = nil
+				if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {


Suggested change

if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {

var err error

if launcher, err = c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {

Could you update launcher after startTime update to avoid coflict while scheduling directive update?

tenzen-y · 2026-02-16T08:03:20Z

pkg/controller/mpi_job_controller.go

+// syncLauncherSchedulingDirectives updates the mutable scheduling directives (as per KEP-2926) on
+// the launcher Job's pod template to match the desired template.
+func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
+	if launcher.Spec.Template.Labels == nil {


Suggested change

if launcher.Spec.Template.Labels == nil {

if desired.Labels != nil && launcher.Spec.Template.Labels == nil {

Optimizing initialization would be better.

I went ahead and re-used some of the jobset code, lmk what you think please

The idea sounds reasonable.
I left a comment for improvement: #772 (comment)

tenzen-y · 2026-02-16T08:03:44Z

pkg/controller/mpi_job_controller.go

+// the launcher Job's pod template to match the desired template.
+func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) {
+	if launcher.Spec.Template.Labels == nil {
+		launcher.Spec.Template.Labels = make(map[string]string)


Suggested change

launcher.Spec.Template.Labels = make(map[string]string)

launcher.Spec.Template.Labels = make(map[string]string, len(desired.Labels))

tenzen-y · 2026-02-16T08:04:56Z

pkg/controller/mpi_job_controller.go

+	if desired.Annotations != nil {
+		if launcher.Spec.Template.Annotations == nil {
+			launcher.Spec.Template.Annotations = make(map[string]string)
+		}
+		for k, v := range desired.Annotations {
+			launcher.Spec.Template.Annotations[k] = v
+		}
+	}


Suggested change

if desired.Annotations != nil {

if launcher.Spec.Template.Annotations == nil {

launcher.Spec.Template.Annotations = make(map[string]string)

}

for k, v := range desired.Annotations {

launcher.Spec.Template.Annotations[k] = v

}

}

if desired.Annotations != nil && launcher.Spec.Template.Annotations == nil {

launcher.Spec.Template.Annotations = make(map[string]string)

}

for k, v := range desired.Annotations {

launcher.Spec.Template.Annotations[k] = v

}

The range loop will be executed only when the desired.Annotaions are not null.

tenzen-y · 2026-02-16T18:29:03Z

pkg/controller/mpi_job_controller.go

+	mergeMaps := func(old, new map[string]string) map[string]string {
+		merged := make(map[string]string, max(len(old), len(new)))
+		maps.Copy(merged, old)
+		maps.Copy(merged, new)
+		return merged
+	}


Could you implement mergeMaps[K comparable, V any](a, b map[K]V]) map[K]V separately instead of an anonymous function?

func mergeMaps[K comparable, V any](a, b map[K]V]) map[K]V { merged := make(map[K]V, max(len(a), len(b))) maps.Copy(merged, a) maps.Copy(merged, b) return merged }

tenzen-y · 2026-02-16T18:29:38Z

@GonzaloSaez, some of the CI jobs failed. Please take a look.

tenzen-y · 2026-02-16T18:31:25Z

test/integration/mpi_job_controller_test.go

+		"kueue.x-k8s.io/workload": "my-workload",
+	}
+	launcherTemplate.Spec.NodeSelector = map[string]string{
+		"cloud.google.com/gke-accelerator": "nvidia-tesla-t4",


Suggested change

"cloud.google.com/gke-accelerator": "nvidia-tesla-t4",

"example.com/accelerator": "example-model",

Could you avoid the vendor-specific one?

tenzen-y · 2026-02-16T18:32:19Z

test/integration/mpi_job_controller_test.go

+	// launcher Job gets the updated scheduling directives on second resume.
+	mpiJobLauncherTemplate := &mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher].Template
+	mpiJobLauncherTemplate.ObjectMeta.Labels["foo"] = "baz"
+	mpiJobLauncherTemplate.Spec.NodeSelector["cloud.google.com/gke-accelerator"] = "nvidia-tesla-t4-v2"


Suggested change

mpiJobLauncherTemplate.Spec.NodeSelector["cloud.google.com/gke-accelerator"] = "nvidia-tesla-t4-v2"

mpiJobLauncherTemplate.Spec.NodeSelector["example.com/accelerator"] = "example-model"

ditto

Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>

GonzaloSaez · 2026-02-16T19:33:49Z

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

tenzen-y · 2026-02-17T06:47:50Z

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

Thank you for verifying the TAS test. Yes, ideally, we would like to confirm the test case, but let use try that separate from this enhancement.

I will also check if any are missing.

tenzen-y · 2026-02-17T06:49:27Z

pkg/controller/mpi_job_controller.go

 }

+func mergeMaps[K comparable, V any](a, b map[K]V) map[K]V {
+	merged := make(map[K]V, max(len(a), len(b)))


Suggested change

merged := make(map[K]V, max(len(a), len(b)))

merged := make(map[K]V, len(a)+len(b))

Sorry for the confusion. As I check this code again, shouldn't this be the sum of a and b?

It depends, if a and b have the same or very similar keys then we'd over-allocating. Lmk what you prefer

I believe that the caller functions of mergeMaps should avoid consider internal implementations which means even the length of a and b are pretty different should be considered.

Surely, in the worst case (a and b are mostly the same and both have very big lengths), it will allocate too redundant memory.

Alright, both approaches (max(a, b) and sum(a, b)) have different problems, and I don't want to waste time on trivial discussions. So, I would approve the current your approach.

/lgtm
/approve

tenzen-y

@GonzaloSaez Thank you for addressed comment.
Otherwise LGTM

google-oss-prow · 2026-02-17T07:07:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2026-02-17T18:46:04Z

I tried running the kueue e2e test from kubernetes-sigs/kueue#9253 but it still fails. However, I see the nodeSelector, scheduling gates, etc. being propagated to the launcher job so I think it may be related to the job configuration or that we are missing something else wrt to separation between launcher and worker pods in kueue. I can also take a look at it if needed.

I manually checked the E2E case, and verifications succeeded. One thing is that the current expected result is not correct. The correct one is the following:

			//	 wantAssignment := map[string]string{
			//		 "launcher/0": "kind-worker",
			//		 "worker/1":   "kind-worker",
			//		 "worker/2":   "kind-worker2",
			//		 "worker/3":   "kind-worker3",
			//	 }

This is expected because the MPIJob resource requirements are mixed across roles.
I will open Kueue PR to update MPIOperator, fix the case, and enable the case.

google-oss-prow bot requested review from carmark and gaocegege February 15, 2026 00:59

google-oss-prow bot added the size/L label Feb 15, 2026

GonzaloSaez mentioned this pull request Feb 15, 2026

Launcher PodSpec updates are not propagated to the underlying batch/v1 Job on MPIJob resume #770

Closed

GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from 880261d to fe8d324 Compare February 15, 2026 17:39

tenzen-y reviewed Feb 16, 2026

View reviewed changes

tenzen-y mentioned this pull request Feb 16, 2026

POD NodeSelector is not always consistent with their MPIJob node selector kubernetes-sigs/kueue#3400

Closed

GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from fe8d324 to 9228f9b Compare February 16, 2026 18:13

tenzen-y reviewed Feb 16, 2026

View reviewed changes

Fix launcher job scheduling directives when unsuspending

3e448c5

Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>

GonzaloSaez force-pushed the fix_kueue_launcher_suspended branch from 9228f9b to 3e448c5 Compare February 16, 2026 19:30

tenzen-y reviewed Feb 17, 2026

View reviewed changes

google-oss-prow bot assigned tenzen-y Feb 17, 2026

google-oss-prow bot added the lgtm label Feb 17, 2026

google-oss-prow bot added the approved label Feb 17, 2026

google-oss-prow bot merged commit c72caac into kubeflow:master Feb 17, 2026
10 checks passed

This was referenced Feb 17, 2026

launcher pod spec changes not applied when suspend and resume MPIJob #687

Closed

Do not create the launcher job if the job starts suspended #670

Open

	if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
	var err error
	if launcher, err = c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {

	if launcher.Spec.Template.Labels == nil {
	if desired.Labels != nil && launcher.Spec.Template.Labels == nil {

	launcher.Spec.Template.Labels = make(map[string]string)
	launcher.Spec.Template.Labels = make(map[string]string, len(desired.Labels))

	"cloud.google.com/gke-accelerator": "nvidia-tesla-t4",
	"example.com/accelerator": "example-model",

	mpiJobLauncherTemplate.Spec.NodeSelector["cloud.google.com/gke-accelerator"] = "nvidia-tesla-t4-v2"
	mpiJobLauncherTemplate.Spec.NodeSelector["example.com/accelerator"] = "example-model"

	merged := make(map[K]V, max(len(a), len(b)))
	merged := make(map[K]V, len(a)+len(b))

Comments

Conversation

GonzaloSaez commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tenzen-y commented Feb 15, 2026

Uh oh!

tenzen-y commented Feb 15, 2026

Uh oh!

tenzen-y commented Feb 15, 2026

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tenzen-y Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tenzen-y commented Feb 16, 2026

Uh oh!

tenzen-y Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GonzaloSaez commented Feb 16, 2026

Uh oh!

tenzen-y commented Feb 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tenzen-y Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

Uh oh!

tenzen-y commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GonzaloSaez commented Feb 15, 2026 •

edited

Loading

tenzen-y Feb 16, 2026 •

edited

Loading

tenzen-y Feb 16, 2026 •

edited

Loading

tenzen-y Feb 17, 2026 •

edited

Loading