Skip to content

DaskCluster stuck in Pending state - Possible race condition? #968

@daniel-jimenezgarcia-ow

Description

@daniel-jimenezgarcia-ow

Describe the issue:

I have noticed several instances of clusters stuck "Pending" while the scheduler and workers were up and running. This happens infrequently, but it happened enough during the same day for me to notice.

I was able to reproduce by repeatedly creating DaskCluster objects via kubectl, using the sample DaskCluster from https://kubernetes.dask.org/en/latest/operator_resources.html#daskcluster

In the section below you have the operator logs from both a successful cluster and another one that got stuck. As noted there, perhaps there is a race condition in the controller code between the service event and the cluster creation

Minimal Complete Verifiable Example:

I am afraid it happens infrequently, so it might be necessary to create/destroy many clusters like the one in https://kubernetes.dask.org/en/latest/operator_resources.html#daskcluster

In my experience, it seems to happen a bit more frequently with the real cluster I use in my application (using our own container image, as well as specific resource blocks for worker/scheduler) but I was able to reproduce at least once with the sample cluster from the docs linked above.

Anything else we need to know?:

These are the dask-operator logs when the DaskCluster is successfully created and set as "Running"

[2026-02-06 09:35:04,907] kopf.objects         [INFO    ] [my-namespace/simple] DaskCluster simple created in my-namespace.
[2026-02-06 09:35:04,907] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_create' succeeded.
[2026-02-06 09:35:04,940] kopf.objects         [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create': {'started': '2026-02-06T09:35:04.907402+00:00', 'stopped': '2026-02-06T09:35:04.908074+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskcluster_default_worker_group_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:35:04.907417+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:05,044] kopf.objects         [INFO    ] [my-namespace/simple] Creating Dask cluster components.
[2026-02-06 09:35:05,139] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/simple-scheduler "HTTP/1.1 401 Unauthorized"
[2026-02-06 09:35:05,276] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 401 Unauthorized"
[2026-02-06 09:35:05,339] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,342] kopf.objects         [INFO    ] [my-namespace/simple] Scheduler deployment simple-scheduler created in my-namespace.
[2026-02-06 09:35:05,375] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:35:05,436] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/api/v1/namespaces/my-namespace/services "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,437] kopf.objects         [INFO    ] [my-namespace/simple] Scheduler service simple-scheduler created in my-namespace.
[2026-02-06 09:35:05,463] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 404 Not Found"
[2026-02-06 09:35:05,478] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,479] kopf.objects         [INFO    ] [my-namespace/simple] Worker group simple-default created in my-namespace.
[2026-02-06 09:35:05,479] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[2026-02-06 09:35:05,503] kopf.objects         [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:35:05.044019+00:00', 'stopped': '2026-02-06T09:35:05.479805+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:05,545] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,562] httpx                [INFO    ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,564] kopf.objects         [INFO    ] [my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[2026-02-06 09:35:05,565] kopf.objects         [INFO    ] [my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[2026-02-06 09:35:05,591] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,620] httpx                [INFO    ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,622] kopf.objects         [INFO    ] [my-namespace/simple-default] Successfully adopted by simple
[2026-02-06 09:35:05,632] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,667] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_default_worker_group_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:35:05,667] kopf.objects         [INFO    ] [my-namespace/simple] Creation is processed: 3 succeeded; 0 failed.
[2026-02-06 09:35:05,948] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,959] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,962] kopf.objects         [INFO    ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:35:05,962] kopf.objects         [INFO    ] [my-namespace/simple-default] Handler 'daskworkergroup_create' succeeded.
[2026-02-06 09:35:05,979] kopf.objects         [WARNING ] [my-namespace/simple-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_create': {'started': '2026-02-06T09:35:05.581734+00:00', 'stopped': '2026-02-06T09:35:05.962577+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:35:05.581745+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:06,091] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:06,101] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:06,104] kopf.objects         [INFO    ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:35:06,105] kopf.objects         [INFO    ] [my-namespace/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:35:06,105] kopf.objects         [INFO    ] [my-namespace/simple-default] Creation is processed: 2 succeeded; 0 failed.

And these are the same logs, but this time when the DaskCluster got stuck in "Pending" state:

[2026-02-06 09:36:05,304] kopf.objects         [INFO    ] [my-namespace/simple] DaskCluster simple created in my-namespace.
[2026-02-06 09:36:05,305] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_create' succeeded.
[2026-02-06 09:36:05,344] kopf.objects         [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create': {'started': '2026-02-06T09:36:05.304732+00:00', 'stopped': '2026-02-06T09:36:05.305222+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskcluster_default_worker_group_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:36:05.304743+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:05,447] kopf.objects         [INFO    ] [my-namespace/simple] Creating Dask cluster components.
[2026-02-06 09:36:05,493] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,533] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,536] kopf.objects         [INFO    ] [my-namespace/simple] Scheduler deployment simple-scheduler created in my-namespace.
[2026-02-06 09:36:05,579] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,650] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/api/v1/namespaces/my-namespace/services "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,652] kopf.objects         [INFO    ] [my-namespace/simple] Scheduler service simple-scheduler created in my-namespace.
[2026-02-06 09:36:05,693] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,733] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,734] kopf.objects         [INFO    ] [my-namespace/simple] Worker group simple-default created in my-namespace.
[2026-02-06 09:36:05,734] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[2026-02-06 09:36:05,748] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,767] httpx                [INFO    ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,769] kopf.objects         [INFO    ] [my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[2026-02-06 09:36:05,770] kopf.objects         [INFO    ] [my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[2026-02-06 09:36:05,783] kopf.objects         [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:36:05.447688+00:00', 'stopped': '2026-02-06T09:36:05.734928+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:05,860] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,874] httpx                [INFO    ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,875] kopf.objects         [INFO    ] [my-namespace/simple-default] Successfully adopted by simple
[2026-02-06 09:36:05,884] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,886] kopf.objects         [INFO    ] [my-namespace/simple] Handler 'daskcluster_default_worker_group_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:36:05,886] kopf.objects         [INFO    ] [my-namespace/simple] Creation is processed: 3 succeeded; 0 failed.
[2026-02-06 09:36:05,896] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,910] httpx                [INFO    ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,912] kopf.objects         [INFO    ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:36:05,912] kopf.objects         [INFO    ] [my-namespace/simple-default] Handler 'daskworkergroup_create' succeeded.
[2026-02-06 09:36:05,940] kopf.objects         [WARNING ] [my-namespace/simple-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_create': {'started': '2026-02-06T09:36:05.833158+00:00', 'stopped': '2026-02-06T09:36:05.912838+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:36:05.833168+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:06,054] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:06,064] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:06,067] kopf.objects         [INFO    ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:36:06,068] kopf.objects         [INFO    ] [my-namespace/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:36:06,068] kopf.objects         [INFO    ] [my-namespace/simple-default] Creation is processed: 2 succeeded; 0 failed.
[2026-02-06 09:36:07,356] kopf.activities.prob [INFO    ] Activity 'now' succeeded.

Notice the log pattern around the status patch due to the daskcluster_create_components method and the one due to the handle_scheduler_service_status event looks a bit different:

  • in the successful cluster
[my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:35:05.044019+00:00', 'stopped': '2026-02-06T09:35:05.479805+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
  • in the cluster stuck as Pending
[my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:36:05.447688+00:00', 'stopped': '2026-02-06T09:36:05.734928+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"

What got my eye is that the line [my-namespace/simple] Patching failed with inconsistencies in the case were cluster got stuck appears after the lines annotated with [my-namespace/simple-scheduler]. This pattern repeats when reviewing the controller logs for other clusters that got stuck, and comparing them to successful clusters. Ie, I see the same difference in the sequencing of the log lines.

This is making me wonder if there is a potential race condition triggered after the service gets created by daskcluster_create_components in https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L383.

Assuming this analysis correct, perhaps daskcluster_create_components could be enhanced so it does not update the status if its already been set as Running.

Environment:

  • Dask version: dask-kubernetes version 2025.7.0
  • Python version: N/A
  • Operating System: Azure AKS 1.32.5
  • Install method (conda, pip, source): operator installed via chart in https://helm.dask.org

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions