Skip to content

Comments

Do not create the launcher job if the job starts suspended#670

Open
GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
GonzaloSaez:g/fix_job_launch_wait_for_pods_ready
Open

Do not create the launcher job if the job starts suspended#670
GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
GonzaloSaez:g/fix_job_launch_wait_for_pods_ready

Conversation

@GonzaloSaez
Copy link
Contributor

@GonzaloSaez GonzaloSaez commented Oct 31, 2024

When the MPIJob starts suspended, we were creating the launcher job no matter the initial suspended state. This causes issues with kueue, since it will suspend the MPIJob but it will create a job with the wrong NodeSelector coming from the kueue flavour. I think avoiding creating the launcher in this scenario is the right thing to do but I'm not sure if others have different thoughts.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +661 to +663
// If the job is suspended, the list of worker pods will be incorrect. We also do
// not want to start the launcher job if the MPIJob starts suspended.
if launcher == nil && !isMPIJobSuspended(mpiJob) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens the question on what should be done when unsuspending the launcher job in case kueue has decided to change the NodeSelector? Should we instead recreate the job since NodeSelector is immutable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this question does not have a straigtforward answer, I can see at least two approaches possible:

  1. as in JobSet- update the NodeSelector field in pod template when resuming the Job
  2. Recreate the launcher/worker Jobs , this can be probably achieved easily by deleting the jobs on suspending the MPIJob

I'm ok with any of those that is simpler to implement. Any optinion @tenzen-y ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I think it is safe to decouple the fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. IMO, I would selecting as in JobSet- update the NodeSelector field in pod template when resuming the Job instead of current solution.
Because I want to align with the behavior with JobSet since the JobSet with suspended creates Job.

Copy link
Contributor Author

@GonzaloSaez GonzaloSaez Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already being done in theory by kueue with RunWithPodSetsInfo, right? I think the issue here is that we create the job in non-suspended move even if the MPIJob is suspended. This results in pods being scheduled in nodes and then removed because the controller suspends the job later on.

Copy link

@ttakahashi21 ttakahashi21 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y

If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:

Thank you for your comment. I've tested with setting launcherCreationPolicy to WaitForWorkersReady, but it doesn't work as expected when kueue is enabled and nominalQuota is less than actual allocatable GPUs.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior
1 disabled - WaitForWorkersReady work as expected
2 enabled NominalQuota == Allocatable WaitForWorkersReady work as expected
3 enabled NominalQuota > Allocatable WaitForWorkersReady Not work as expected

Note that the result I previously shared was tested with below conditions as you pointed out.

# kueue Condition of GPUs launcherCreationPolicy(MPIJob) Behavior
0 enabled NominalQuota > Allocatable LauncherCreationPolicyAtStartup(deafult) work as expected

case1

When MPIJob is executed exceeding the nominalQuota of GPU resource, GPU resource is insufficient. So, worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. In this case, the launcher do not run.

  • WaitForWorkersReady feature works fine when Kueue is not used.
# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   12s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   12s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-k9r6m   1/1     Running   0          10s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          12s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          12s
pod/tensorflow-benchmarks-job2-worker-0         0/1     Pending   0          12s
pod/tensorflow-benchmarks-job2-worker-1         0/1     Pending   0          12s
Details
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name:         tensorflow-benchmarks-job1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:54:48Z
  Generation:          1
  Resource Version:    2126
  UID:                 54b7c0a2-2bb7-40b4-9cb3-92991bf15c5a
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:54:48Z
    Last Update Time:      2025-03-12T20:54:48Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:54:52Z
    Last Update Time:      2025-03-12T20:54:52Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is running.
    Reason:                MPIJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
      Active:  2
  Start Time:  2025-03-12T20:54:48Z
Events:
  Type    Reason         Age                From                Message
  ----    ------         ----               ----                -------
  Normal  MPIJobCreated  41s                mpi-job-controller  MPIJob default/tensorflow-benchmarks-job1 is created.
  Normal  MPIJobRunning  36s (x3 over 37s)  mpi-job-controller  MPIJob default/tensorflow-benchmarks-job1 is running
Details
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:54:48Z
  Generation:          1
  Resource Version:    2079
  UID:                 0e11058e-32db-42c3-bdc5-60743a9087a0
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:54:48Z
    Last Update Time:      2025-03-12T20:54:48Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
  Replica Statuses:
    Worker:
  Start Time:  2025-03-12T20:54:48Z
Events:
  Type    Reason         Age   From                Message
  ----    ------         ----  ----                -------
  Normal  MPIJobCreated  109s  mpi-job-controller  MPIJob default/tensorflow-benchmarks-job2 is created.

case2

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob change false to true. In this case, MPIJob of launcher and worker do not run.


case3

When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob do not change false to true.
In this case, MPIJob try to run. But, GPU resource is insufficient, so worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. However, the launcher is running.

# kubectl get mpijob,workload,pod
NAME                                             AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1   14s
mpijob.kubeflow.org/tensorflow-benchmarks-job2   14s

NAME                                                              QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3   user-queue   cluster-queue   True                  14s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330   user-queue   cluster-queue   True                  14s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/tensorflow-benchmarks-job1-launcher-mfj4b   1/1     Running   0          13s
pod/tensorflow-benchmarks-job1-worker-0         1/1     Running   0          14s
pod/tensorflow-benchmarks-job1-worker-1         1/1     Running   0          14s
pod/tensorflow-benchmarks-job2-launcher-6b5p6   1/1     Running   0          11s #launcher is running,
pod/tensorflow-benchmarks-job2-worker-0         0/1     Pending   0          11s
pod/tensorflow-benchmarks-job2-worker-1         0/1     Pending   0          11s
Details
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name:         tensorflow-benchmarks-job1
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Generation:          2
  Resource Version:    2768
  UID:                 a6621976-3f98-40d4-8cff-98478fbe603b
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job1
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Last Update Time:      2025-03-12T20:58:36Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:58:37Z
    Last Update Time:      2025-03-12T20:58:37Z
    Message:               MPIJob resumed
    Reason:                MPIJobResumed
    Status:                False
    Type:                  Suspended
    Last Transition Time:  2025-03-12T20:58:39Z
    Last Update Time:      2025-03-12T20:58:39Z
    Message:               MPIJob default/tensorflow-benchmarks-job1 is running.
    Reason:                MPIJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
      Active:  2
  Start Time:  2025-03-12T20:58:37Z
Events:
  Type    Reason           Age                From                                  Message
  ----    ------           ----               ----                                  -------
  Normal  MPIJobCreated    40s (x2 over 40s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job1 is created.
  Normal  MPIJobSuspended  40s (x2 over 40s)  mpi-job-controller                    MPIJob suspended
  Normal  CreatedWorkload  40s                kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job1-11dc3
  Normal  Started          40s                kubeflow.org/mpijob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  MPIJobResumed    39s (x2 over 39s)  mpi-job-controller                    MPIJob resumed
  Normal  MPIJobRunning    36s (x3 over 37s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job1 is running
Details
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name:         tensorflow-benchmarks-job2
Namespace:    default
Labels:       kueue.x-k8s.io/queue-name=user-queue
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Generation:          2
  Resource Version:    2810
  UID:                 58b33273-19b3-4e86-9152-fac3415526f7
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Metadata:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
    Worker:
      Replicas:  2
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks-job2
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Run Launcher As Worker:        false
  Run Policy:
    Clean Pod Policy:   Running
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Status:
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Last Update Time:      2025-03-12T20:58:36Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is created.
    Reason:                MPIJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-03-12T20:58:37Z
    Last Update Time:      2025-03-12T20:58:37Z
    Message:               MPIJob default/tensorflow-benchmarks-job2 is suspended.
    Reason:                MPIJobSuspended
    Status:                False
    Type:                  Running
    Last Transition Time:  2025-03-12T20:58:39Z
    Last Update Time:      2025-03-12T20:58:39Z
    Message:               MPIJob resumed
    Reason:                MPIJobResumed
    Status:                False
    Type:                  Suspended
  Replica Statuses:
    Launcher:
      Active:  1
    Worker:
  Start Time:  2025-03-12T20:58:39Z
Events:
  Type    Reason           Age                From                                  Message
  ----    ------           ----               ----                                  -------
  Normal  CreatedWorkload  75s                kubeflow.org/mpijob-kueue-controller  Created Workload: default/mpijob-tensorflow-benchmarks-job2-64330
  Normal  MPIJobCreated    74s (x2 over 75s)  mpi-job-controller                    MPIJob default/tensorflow-benchmarks-job2 is created.
  Normal  MPIJobSuspended  74s (x2 over 74s)  mpi-job-controller                    MPIJob suspended
  Normal  Started          72s                kubeflow.org/mpijob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  MPIJobResumed    72s (x2 over 72s)  mpi-job-controller                    MPIJob resumed
Details
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3
Name:         mpijob-tensorflow-benchmarks-job1-11dc3
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=a6621976-3f98-40d4-8cff-98478fbe603b
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job1
    UID:                   a6621976-3f98-40d4-8cff-98478fbe603b
  Resource Version:        2770
  UID:                     cdb1c18a-144d-456f-be18-3073b6bd59b4
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job1
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job1
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Admission:
    Cluster Queue:  cluster-queue
    Pod Set Assignments:
      Count:  1
      Name:   launcher
      Count:  2
      Flavors:
        nvidia.com/gpu:  default-flavor
      Name:              worker
      Resource Usage:
        nvidia.com/gpu:  2
  Conditions:
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               The workload is admitted
    Observed Generation:   1
    Reason:                Admitted
    Status:                True
    Type:                  Admitted
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               All pods were ready or succeeded since the workload admission
    Observed Generation:   1
    Reason:                PodsReady
    Status:                True
    Type:                  PodsReady
Events:
  Type    Reason         Age   From             Message
  ----    ------         ----  ----             -------
  Normal  QuotaReserved  2m7s  kueue-admission  Quota reserved in ClusterQueue cluster-queue, wait time since queued was 0s
  Normal  Admitted       2m7s  kueue-admission  Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s
Details
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330
Name:         mpijob-tensorflow-benchmarks-job2-64330
Namespace:    default
Labels:       kueue.x-k8s.io/job-uid=58b33273-19b3-4e86-9152-fac3415526f7
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         Workload
Metadata:
  Creation Timestamp:  2025-03-12T20:58:36Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v2beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MPIJob
    Name:                  tensorflow-benchmarks-job2
    UID:                   58b33273-19b3-4e86-9152-fac3415526f7
  Resource Version:        2772
  UID:                     8fb74084-927a-4abc-90c8-16efb5c1515d
Spec:
  Active:  true
  Pod Sets:
    Count:  1
    Name:   launcher
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            mpirun
            --allow-run-as-root
            -np
            2
            -bind-to
            none
            -map-by
            slot
            -x
            NCCL_DEBUG=INFO
            -x
            LD_LIBRARY_PATH
            -x
            PATH
            -mca
            pml
            ob1
            -mca
            btl
            ^openib
            python
            scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model=resnet101
            --batch_size=64
            --variable_update=horovod
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
    Count:  2
    Name:   worker
    Template:
      Metadata:
      Spec:
        Containers:
          Image:  mpioperator/tensorflow-benchmarks:latest
          Name:   tensorflow-benchmarks-job2
          Resources:
            Limits:
              nvidia.com/gpu:  1
  Priority:                    0
  Priority Class Source:       
  Queue Name:                  user-queue
Status:
  Admission:
    Cluster Queue:  cluster-queue
    Pod Set Assignments:
      Count:  1
      Name:   launcher
      Count:  2
      Flavors:
        nvidia.com/gpu:  default-flavor
      Name:              worker
      Resource Usage:
        nvidia.com/gpu:  2
  Conditions:
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Observed Generation:   1
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2025-03-12T20:58:36Z
    Message:               Not all pods are ready or succeeded
    Observed Generation:   1
    Reason:                PodsReady
    Status:                False
    Type:                  PodsReady
    Last Transition Time:  2025-03-12T20:58:39Z
    Message:               The workload is admitted
    Observed Generation:   1
    Reason:                Admitted
    Status:                True
    Type:                  Admitted
Events:
  Type    Reason         Age    From             Message
  ----    ------         ----   ----             -------
  Normal  QuotaReserved  2m47s  kueue-admission  Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
  Normal  Admitted       2m47s  kueue-admission  Admitted by ClusterQueue cluster-queue, wait time since reservation was 0s

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
Additional information

https://github.com/ttakahashi21/mpi-operator/blob/dev-takahashi/pkg/controller/mpi_job_controller.go#L652-L686
I wrote the above debugging code and confirmed.
When using Kueue, is it expected that the initial value of mpiJob.Spec.RunPolicy.Suspend will be true?

case1

  • The default value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to false. Therefore, "! isMPIJobSuspended(mpiJob)" for Job1 and Job2 is true. This creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. In job1, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true and the launcher for job 1 is created. In job2, "LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup” and ”c. countReadyWorkerPods(worker) == len(worker)” are not true, respectively, so the launcher for job2 is not created.

The details of the debug log are as follows

Details
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 

case3

The value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to true when MPIjob is started for the first time. Therefore, "!isMPIJobSuspended(mpiJob)" for Job1 and Job2 is false. So, this can not creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. However, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true in job1 and job2 and the launcher for job 1 and job2 is created. The reason why “c.countReadyWorkerPods(worker)” and “len(worker)” will both be 0 because the worker information does not get.

The details of the debug log are as follows

Details
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 0 - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi 
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi 

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, #617 modification is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fixed the propagation problem in #772.

@mimowo
Copy link
Contributor

mimowo commented Nov 4, 2024

@GonzaloSaez please fix the DCO

@GonzaloSaez GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 802bd49 to 804cfd8 Compare November 5, 2024 06:52
@mimowo
Copy link
Contributor

mimowo commented Nov 8, 2024

For context, linking it back to the related Kueue issue: kubernetes-sigs/kueue#3400 and the slack discussion https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399

@GonzaloSaez GonzaloSaez force-pushed the g/fix_job_launch_wait_for_pods_ready branch from 804cfd8 to c1ea13d Compare January 10, 2025 22:58
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants