-
-
Notifications
You must be signed in to change notification settings - Fork 157
Description
Describe the issue:
I have noticed several instances of clusters stuck "Pending" while the scheduler and workers were up and running. This happens infrequently, but it happened enough during the same day for me to notice.
I was able to reproduce by repeatedly creating DaskCluster objects via kubectl, using the sample DaskCluster from https://kubernetes.dask.org/en/latest/operator_resources.html#daskcluster
In the section below you have the operator logs from both a successful cluster and another one that got stuck. As noted there, perhaps there is a race condition in the controller code between the service event and the cluster creation
Minimal Complete Verifiable Example:
I am afraid it happens infrequently, so it might be necessary to create/destroy many clusters like the one in https://kubernetes.dask.org/en/latest/operator_resources.html#daskcluster
In my experience, it seems to happen a bit more frequently with the real cluster I use in my application (using our own container image, as well as specific resource blocks for worker/scheduler) but I was able to reproduce at least once with the sample cluster from the docs linked above.
Anything else we need to know?:
These are the dask-operator logs when the DaskCluster is successfully created and set as "Running"
[2026-02-06 09:35:04,907] kopf.objects [INFO ] [my-namespace/simple] DaskCluster simple created in my-namespace.
[2026-02-06 09:35:04,907] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_create' succeeded.
[2026-02-06 09:35:04,940] kopf.objects [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create': {'started': '2026-02-06T09:35:04.907402+00:00', 'stopped': '2026-02-06T09:35:04.908074+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskcluster_default_worker_group_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:35:04.907417+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:05,044] kopf.objects [INFO ] [my-namespace/simple] Creating Dask cluster components.
[2026-02-06 09:35:05,139] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/simple-scheduler "HTTP/1.1 401 Unauthorized"
[2026-02-06 09:35:05,276] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 401 Unauthorized"
[2026-02-06 09:35:05,339] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,342] kopf.objects [INFO ] [my-namespace/simple] Scheduler deployment simple-scheduler created in my-namespace.
[2026-02-06 09:35:05,375] httpx [INFO ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:35:05,436] httpx [INFO ] HTTP Request: POST https://10.0.0.1/api/v1/namespaces/my-namespace/services "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,437] kopf.objects [INFO ] [my-namespace/simple] Scheduler service simple-scheduler created in my-namespace.
[2026-02-06 09:35:05,463] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 404 Not Found"
[2026-02-06 09:35:05,478] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,479] kopf.objects [INFO ] [my-namespace/simple] Worker group simple-default created in my-namespace.
[2026-02-06 09:35:05,479] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[2026-02-06 09:35:05,503] kopf.objects [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:35:05.044019+00:00', 'stopped': '2026-02-06T09:35:05.479805+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:05,545] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,562] httpx [INFO ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,564] kopf.objects [INFO ] [my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[2026-02-06 09:35:05,565] kopf.objects [INFO ] [my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[2026-02-06 09:35:05,591] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,620] httpx [INFO ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,622] kopf.objects [INFO ] [my-namespace/simple-default] Successfully adopted by simple
[2026-02-06 09:35:05,632] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,667] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_default_worker_group_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:35:05,667] kopf.objects [INFO ] [my-namespace/simple] Creation is processed: 3 succeeded; 0 failed.
[2026-02-06 09:35:05,948] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:05,959] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:35:05,962] kopf.objects [INFO ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:35:05,962] kopf.objects [INFO ] [my-namespace/simple-default] Handler 'daskworkergroup_create' succeeded.
[2026-02-06 09:35:05,979] kopf.objects [WARNING ] [my-namespace/simple-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_create': {'started': '2026-02-06T09:35:05.581734+00:00', 'stopped': '2026-02-06T09:35:05.962577+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:35:05.581745+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:35:06,091] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:35:06,101] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:35:06,104] kopf.objects [INFO ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:35:06,105] kopf.objects [INFO ] [my-namespace/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:35:06,105] kopf.objects [INFO ] [my-namespace/simple-default] Creation is processed: 2 succeeded; 0 failed.
And these are the same logs, but this time when the DaskCluster got stuck in "Pending" state:
[2026-02-06 09:36:05,304] kopf.objects [INFO ] [my-namespace/simple] DaskCluster simple created in my-namespace.
[2026-02-06 09:36:05,305] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_create' succeeded.
[2026-02-06 09:36:05,344] kopf.objects [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create': {'started': '2026-02-06T09:36:05.304732+00:00', 'stopped': '2026-02-06T09:36:05.305222+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskcluster_default_worker_group_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:36:05.304743+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:05,447] kopf.objects [INFO ] [my-namespace/simple] Creating Dask cluster components.
[2026-02-06 09:36:05,493] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,533] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,536] kopf.objects [INFO ] [my-namespace/simple] Scheduler deployment simple-scheduler created in my-namespace.
[2026-02-06 09:36:05,579] httpx [INFO ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services/simple-scheduler "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,650] httpx [INFO ] HTTP Request: POST https://10.0.0.1/api/v1/namespaces/my-namespace/services "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,652] kopf.objects [INFO ] [my-namespace/simple] Scheduler service simple-scheduler created in my-namespace.
[2026-02-06 09:36:05,693] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 404 Not Found"
[2026-02-06 09:36:05,733] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,734] kopf.objects [INFO ] [my-namespace/simple] Worker group simple-default created in my-namespace.
[2026-02-06 09:36:05,734] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[2026-02-06 09:36:05,748] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,767] httpx [INFO ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,769] kopf.objects [INFO ] [my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[2026-02-06 09:36:05,770] kopf.objects [INFO ] [my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[2026-02-06 09:36:05,783] kopf.objects [WARNING ] [my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:36:05.447688+00:00', 'stopped': '2026-02-06T09:36:05.734928+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:05,860] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,874] httpx [INFO ] HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,875] kopf.objects [INFO ] [my-namespace/simple-default] Successfully adopted by simple
[2026-02-06 09:36:05,884] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,886] kopf.objects [INFO ] [my-namespace/simple] Handler 'daskcluster_default_worker_group_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:36:05,886] kopf.objects [INFO ] [my-namespace/simple] Creation is processed: 3 succeeded; 0 failed.
[2026-02-06 09:36:05,896] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:05,910] httpx [INFO ] HTTP Request: POST https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments "HTTP/1.1 201 Created"
[2026-02-06 09:36:05,912] kopf.objects [INFO ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:36:05,912] kopf.objects [INFO ] [my-namespace/simple-default] Handler 'daskworkergroup_create' succeeded.
[2026-02-06 09:36:05,940] kopf.objects [WARNING ] [my-namespace/simple-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_create': {'started': '2026-02-06T09:36:05.833158+00:00', 'stopped': '2026-02-06T09:36:05.912838+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}, 'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2026-02-06T09:36:05.833168+00:00', 'stopped': None, 'delayed': None, 'purpose': 'create', 'retries': 0, 'success': False, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
[2026-02-06 09:36:06,054] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
[2026-02-06 09:36:06,064] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?limit=100&labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2026-02-06 09:36:06,067] kopf.objects [INFO ] [my-namespace/simple-default] Scaled worker group simple-default up to 1 workers.
[2026-02-06 09:36:06,068] kopf.objects [INFO ] [my-namespace/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2026-02-06 09:36:06,068] kopf.objects [INFO ] [my-namespace/simple-default] Creation is processed: 2 succeeded; 0 failed.
[2026-02-06 09:36:07,356] kopf.activities.prob [INFO ] Activity 'now' succeeded.
Notice the log pattern around the status patch due to the daskcluster_create_components method and the one due to the handle_scheduler_service_status event looks a bit different:
- in the successful cluster
[my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
[my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:35:05.044019+00:00', 'stopped': '2026-02-06T09:35:05.479805+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
- in the cluster stuck as Pending
[my-namespace/simple] Handler 'daskcluster_create_components/status.phase' succeeded.
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters/simple "HTTP/1.1 200 OK"
[my-namespace/simple-scheduler] Handler 'handle_scheduler_service_status/status' succeeded.
[my-namespace/simple-scheduler] Creation is processed: 1 succeeded; 0 failed.
[my-namespace/simple] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create_components/status.phase': {'started': '2026-02-06T09:36:05.447688+00:00', 'stopped': '2026-02-06T09:36:05.734928+00:00', 'delayed': None, 'purpose': 'create', 'retries': 1, 'success': True, 'failure': False, 'message': None, 'subrefs': None}}}, None),)
HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?limit=100&fieldSelector=metadata.name%3Dsimple%2Cmetadata.name%3Dsimple "HTTP/1.1 200 OK"
HTTP Request: PATCH https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskworkergroups/simple-default "HTTP/1.1 200 OK"
What got my eye is that the line [my-namespace/simple] Patching failed with inconsistencies in the case were cluster got stuck appears after the lines annotated with [my-namespace/simple-scheduler]. This pattern repeats when reviewing the controller logs for other clusters that got stuck, and comparing them to successful clusters. Ie, I see the same difference in the sequencing of the log lines.
This is making me wonder if there is a potential race condition triggered after the service gets created by daskcluster_create_components in https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L383.
- normally, first the DaskCluster is set as pending within
daskcluster_create_componentsdue to https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L400, and then the service status event is processed byhandle_scheduler_service_statuswhich updates the status as Running due to https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L423 - sometimes, the service status event is processed by
handle_scheduler_service_status, setting the status as Running before thedaskcluster_create_componentshas completed. Whendaskcluster_create_componentseventually completes, it updates the status back to Pending
Assuming this analysis correct, perhaps daskcluster_create_components could be enhanced so it does not update the status if its already been set as Running.
Environment:
- Dask version: dask-kubernetes version 2025.7.0
- Python version: N/A
- Operating System: Azure AKS 1.32.5
- Install method (conda, pip, source): operator installed via chart in https://helm.dask.org