Do not create the launcher job if the job starts suspended#670
Do not create the launcher job if the job starts suspended#670GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| // If the job is suspended, the list of worker pods will be incorrect. We also do | ||
| // not want to start the launcher job if the MPIJob starts suspended. | ||
| if launcher == nil && !isMPIJobSuspended(mpiJob) { |
There was a problem hiding this comment.
This opens the question on what should be done when unsuspending the launcher job in case kueue has decided to change the NodeSelector? Should we instead recreate the job since NodeSelector is immutable?
There was a problem hiding this comment.
Yeah, I think this question does not have a straigtforward answer, I can see at least two approaches possible:
- as in JobSet- update the NodeSelector field in pod template when resuming the Job
- Recreate the launcher/worker Jobs , this can be probably achieved easily by deleting the jobs on suspending the MPIJob
I'm ok with any of those that is simpler to implement. Any optinion @tenzen-y ?
There was a problem hiding this comment.
In any case I think it is safe to decouple the fixes.
There was a problem hiding this comment.
Sorry for the delayed response. IMO, I would selecting as in JobSet- update the NodeSelector field in pod template when resuming the Job instead of current solution.
Because I want to align with the behavior with JobSet since the JobSet with suspended creates Job.
There was a problem hiding this comment.
That's already being done in theory by kueue with RunWithPodSetsInfo, right? I think the issue here is that we create the job in non-suspended move even if the MPIJob is suspended. This results in pods being scheduled in nodes and then removed because the controller suspends the job later on.
There was a problem hiding this comment.
If you want to guarantee to schedule Worker, first, we can use launcherCreationPolicy:
Thank you for your comment. I've tested with setting launcherCreationPolicy to WaitForWorkersReady, but it doesn't work as expected when kueue is enabled and nominalQuota is less than actual allocatable GPUs.
| # | kueue | Condition of GPUs | launcherCreationPolicy(MPIJob) | Behavior |
|---|---|---|---|---|
| 1 | disabled | - | WaitForWorkersReady | work as expected |
| 2 | enabled | NominalQuota == Allocatable | WaitForWorkersReady | work as expected |
| 3 | enabled | NominalQuota > Allocatable | WaitForWorkersReady | Not work as expected |
Note that the result I previously shared was tested with below conditions as you pointed out.
| # | kueue | Condition of GPUs | launcherCreationPolicy(MPIJob) | Behavior |
|---|---|---|---|---|
| 0 | enabled | NominalQuota > Allocatable | LauncherCreationPolicyAtStartup(deafult) | work as expected |
case1
When MPIJob is executed exceeding the nominalQuota of GPU resource, GPU resource is insufficient. So, worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. In this case, the launcher do not run.
- WaitForWorkersReady feature works fine when Kueue is not used.
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 12s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 12s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-k9r6m 1/1 Running 0 10s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 12s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 12s
pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 12s
pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 12sDetails
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name: tensorflow-benchmarks-job1
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:54:48Z
Generation: 1
Resource Version: 2126
UID: 54b7c0a2-2bb7-40b4-9cb3-92991bf15c5a
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:54:48Z
Last Update Time: 2025-03-12T20:54:48Z
Message: MPIJob default/tensorflow-benchmarks-job1 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:54:52Z
Last Update Time: 2025-03-12T20:54:52Z
Message: MPIJob default/tensorflow-benchmarks-job1 is running.
Reason: MPIJobRunning
Status: True
Type: Running
Replica Statuses:
Launcher:
Active: 1
Worker:
Active: 2
Start Time: 2025-03-12T20:54:48Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 41s mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created.
Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is runningDetails
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:54:48Z
Generation: 1
Resource Version: 2079
UID: 0e11058e-32db-42c3-bdc5-60743a9087a0
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:54:48Z
Last Update Time: 2025-03-12T20:54:48Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Replica Statuses:
Worker:
Start Time: 2025-03-12T20:54:48Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 109s mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.case2
When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob change false to true. In this case, MPIJob of launcher and worker do not run.
case3
When MPIJob is executed exceeding the nominalQuota of GPU resource, spec.runPolicy.suspend paramater of MPIJob do not change false to true.
In this case, MPIJob try to run. But, GPU resource is insufficient, so worker is pending. Therefore, if WaitForWorkersReady for MPIJob is working, it is expected that the launcher will not run. However, the launcher is running.
# kubectl get mpijob,workload,pod
NAME AGE
mpijob.kubeflow.org/tensorflow-benchmarks-job1 14s
mpijob.kubeflow.org/tensorflow-benchmarks-job2 14s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3 user-queue cluster-queue True 14s
workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330 user-queue cluster-queue True 14s
NAME READY STATUS RESTARTS AGE
pod/tensorflow-benchmarks-job1-launcher-mfj4b 1/1 Running 0 13s
pod/tensorflow-benchmarks-job1-worker-0 1/1 Running 0 14s
pod/tensorflow-benchmarks-job1-worker-1 1/1 Running 0 14s
pod/tensorflow-benchmarks-job2-launcher-6b5p6 1/1 Running 0 11s #launcher is running,
pod/tensorflow-benchmarks-job2-worker-0 0/1 Pending 0 11s
pod/tensorflow-benchmarks-job2-worker-1 0/1 Pending 0 11sDetails
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job1
Name: tensorflow-benchmarks-job1
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Generation: 2
Resource Version: 2768
UID: a6621976-3f98-40d4-8cff-98478fbe603b
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Last Update Time: 2025-03-12T20:58:36Z
Message: MPIJob default/tensorflow-benchmarks-job1 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:58:37Z
Last Update Time: 2025-03-12T20:58:37Z
Message: MPIJob resumed
Reason: MPIJobResumed
Status: False
Type: Suspended
Last Transition Time: 2025-03-12T20:58:39Z
Last Update Time: 2025-03-12T20:58:39Z
Message: MPIJob default/tensorflow-benchmarks-job1 is running.
Reason: MPIJobRunning
Status: True
Type: Running
Replica Statuses:
Launcher:
Active: 1
Worker:
Active: 2
Start Time: 2025-03-12T20:58:37Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 40s (x2 over 40s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is created.
Normal MPIJobSuspended 40s (x2 over 40s) mpi-job-controller MPIJob suspended
Normal CreatedWorkload 40s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job1-11dc3
Normal Started 40s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue
Normal MPIJobResumed 39s (x2 over 39s) mpi-job-controller MPIJob resumed
Normal MPIJobRunning 36s (x3 over 37s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job1 is runningDetails
# kubectl describe mpijob.kubeflow.org/tensorflow-benchmarks-job2
Name: tensorflow-benchmarks-job2
Namespace: default
Labels: kueue.x-k8s.io/queue-name=user-queue
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Generation: 2
Resource Version: 2810
UID: 58b33273-19b3-4e86-9152-fac3415526f7
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Worker:
Replicas: 2
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Run Launcher As Worker: false
Run Policy:
Clean Pod Policy: Running
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Status:
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Last Update Time: 2025-03-12T20:58:36Z
Message: MPIJob default/tensorflow-benchmarks-job2 is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2025-03-12T20:58:37Z
Last Update Time: 2025-03-12T20:58:37Z
Message: MPIJob default/tensorflow-benchmarks-job2 is suspended.
Reason: MPIJobSuspended
Status: False
Type: Running
Last Transition Time: 2025-03-12T20:58:39Z
Last Update Time: 2025-03-12T20:58:39Z
Message: MPIJob resumed
Reason: MPIJobResumed
Status: False
Type: Suspended
Replica Statuses:
Launcher:
Active: 1
Worker:
Start Time: 2025-03-12T20:58:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatedWorkload 75s kubeflow.org/mpijob-kueue-controller Created Workload: default/mpijob-tensorflow-benchmarks-job2-64330
Normal MPIJobCreated 74s (x2 over 75s) mpi-job-controller MPIJob default/tensorflow-benchmarks-job2 is created.
Normal MPIJobSuspended 74s (x2 over 74s) mpi-job-controller MPIJob suspended
Normal Started 72s kubeflow.org/mpijob-kueue-controller Admitted by clusterQueue cluster-queue
Normal MPIJobResumed 72s (x2 over 72s) mpi-job-controller MPIJob resumedDetails
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job1-11dc3
Name: mpijob-tensorflow-benchmarks-job1-11dc3
Namespace: default
Labels: kueue.x-k8s.io/job-uid=a6621976-3f98-40d4-8cff-98478fbe603b
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job1
UID: a6621976-3f98-40d4-8cff-98478fbe603b
Resource Version: 2770
UID: cdb1c18a-144d-456f-be18-3073b6bd59b4
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job1
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Admission:
Cluster Queue: cluster-queue
Pod Set Assignments:
Count: 1
Name: launcher
Count: 2
Flavors:
nvidia.com/gpu: default-flavor
Name: worker
Resource Usage:
nvidia.com/gpu: 2
Conditions:
Last Transition Time: 2025-03-12T20:58:36Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2025-03-12T20:58:36Z
Message: The workload is admitted
Observed Generation: 1
Reason: Admitted
Status: True
Type: Admitted
Last Transition Time: 2025-03-12T20:58:39Z
Message: All pods were ready or succeeded since the workload admission
Observed Generation: 1
Reason: PodsReady
Status: True
Type: PodsReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 2m7s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 0s
Normal Admitted 2m7s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0sDetails
# kubectl describe workload.kueue.x-k8s.io/mpijob-tensorflow-benchmarks-job2-64330
Name: mpijob-tensorflow-benchmarks-job2-64330
Namespace: default
Labels: kueue.x-k8s.io/job-uid=58b33273-19b3-4e86-9152-fac3415526f7
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: Workload
Metadata:
Creation Timestamp: 2025-03-12T20:58:36Z
Finalizers:
kueue.x-k8s.io/resource-in-use
Generation: 1
Owner References:
API Version: kubeflow.org/v2beta1
Block Owner Deletion: true
Controller: true
Kind: MPIJob
Name: tensorflow-benchmarks-job2
UID: 58b33273-19b3-4e86-9152-fac3415526f7
Resource Version: 2772
UID: 8fb74084-927a-4abc-90c8-16efb5c1515d
Spec:
Active: true
Pod Sets:
Count: 1
Name: launcher
Template:
Metadata:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Count: 2
Name: worker
Template:
Metadata:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks-job2
Resources:
Limits:
nvidia.com/gpu: 1
Priority: 0
Priority Class Source:
Queue Name: user-queue
Status:
Admission:
Cluster Queue: cluster-queue
Pod Set Assignments:
Count: 1
Name: launcher
Count: 2
Flavors:
nvidia.com/gpu: default-flavor
Name: worker
Resource Usage:
nvidia.com/gpu: 2
Conditions:
Last Transition Time: 2025-03-12T20:58:39Z
Message: Quota reserved in ClusterQueue cluster-queue
Observed Generation: 1
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2025-03-12T20:58:36Z
Message: Not all pods are ready or succeeded
Observed Generation: 1
Reason: PodsReady
Status: False
Type: PodsReady
Last Transition Time: 2025-03-12T20:58:39Z
Message: The workload is admitted
Observed Generation: 1
Reason: Admitted
Status: True
Type: Admitted
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal QuotaReserved 2m47s kueue-admission Quota reserved in ClusterQueue cluster-queue, wait time since queued was 4s
Normal Admitted 2m47s kueue-admission Admitted by ClusterQueue cluster-queue, wait time since reservation was 0sThere was a problem hiding this comment.
@tenzen-y
Additional information
https://github.com/ttakahashi21/mpi-operator/blob/dev-takahashi/pkg/controller/mpi_job_controller.go#L652-L686
I wrote the above debugging code and confirmed.
When using Kueue, is it expected that the initial value of mpiJob.Spec.RunPolicy.Suspend will be true?
case1
- The default value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to false. Therefore, "! isMPIJobSuspended(mpiJob)" for Job1 and Job2 is true. This creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. In job1, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true and the launcher for job 1 is created. In job2, "LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup” and ”c. countReadyWorkerPods(worker) == len(worker)” are not true, respectively, so the launcher for job2 is not created.
The details of the debug log are as follows
Details
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 1 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 2 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
case3
The value of mpiJob.Spec.RunPolicy.Suspend for Job1 and Job2 is set to true when MPIjob is started for the first time. Therefore, "!isMPIJobSuspended(mpiJob)" for Job1 and Job2 is false. So, this can not creates or gets the worker. Since the launcher has not yet been created, the condition “launcher == nil” for Job1 and Job2 is true. However, the condition “c.countReadyWorkerPods(worker) == len(worker)” becomes true in job1 and job2 and the launcher for job 1 and job2 is created. The reason why “c.countReadyWorkerPods(worker)” and “len(worker)” will both be 0 because the worker information does not get.
The details of the debug log are as follows
Details
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - len(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - launcher == nil: true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy : WaitForWorkersReady - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - c.countReadyWorkerPods(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - len(worker) : 0 - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup || c.countReadyWorkerPods(checkworker) == len(checkworker): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - deploy launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job1 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - mpiJob.Spec.RunPolicy.Suspend : false - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-!isMPIJobSuspended(mpiJob) : true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - !isMPIJobSuspended(mpiJob): true - takahashi
mpiJob.Name: tensorflow-benchmarks-job2 - check-launcher - takahashi
There was a problem hiding this comment.
In my opinion, #617 modification is necessary.
There was a problem hiding this comment.
Hi all @GonzaloSaez @ttakahashi21 @mimowo, Thank you for your patience here.
This issue is currently under my radar.
Please see my analysis if you are still interested in this problem:
- Launcher PodSpec updates are not propagated to the underlying batch/v1 Job on MPIJob resume #770
- POD NodeSelector is not always consistent with their MPIJob node selector kubernetes-sigs/kueue#3400 (comment)
- POD NodeSelector is not always consistent with their MPIJob node selector kubernetes-sigs/kueue#3400 (comment)
|
@GonzaloSaez please fix the DCO |
802bd49 to
804cfd8
Compare
|
For context, linking it back to the related Kueue issue: kubernetes-sigs/kueue#3400 and the slack discussion https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399 |
804cfd8 to
c1ea13d
Compare
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
c1ea13d to
50abbdf
Compare
When the MPIJob starts suspended, we were creating the launcher job no matter the initial suspended state. This causes issues with kueue, since it will suspend the MPIJob but it will create a job with the wrong NodeSelector coming from the kueue flavour. I think avoiding creating the launcher in this scenario is the right thing to do but I'm not sure if others have different thoughts.