Skip to content

fix: mark all devices unhealthy if NVML health check failed to start#1641

Open
devincd wants to merge 2 commits intoNVIDIA:mainfrom
devincd:fix/mark-all-unhealthy-on-checkhealth-failure
Open

fix: mark all devices unhealthy if NVML health check failed to start#1641
devincd wants to merge 2 commits intoNVIDIA:mainfrom
devincd:fix/mark-all-unhealthy-on-checkhealth-failure

Conversation

@devincd
Copy link

@devincd devincd commented Mar 3, 2026

PR Description

Overview

This PR ensures that if the NVML health check fails to initialize during the plugin startup phase (e.g., nvml.Init() returns ERROR_UNKNOWN), all managed devices are immediately marked as Unhealthy.

The Problem

Currently, when CheckHealth fails to start, the plugin logs an error but continues to serve. This creates a "Zombie Node" state where:

  1. The plugin is registered, and the node reports healthy GPU capacity.
  2. The K8s scheduler continues to place Pods on the node.
  3. Kubelet's GetPreferredAllocation calls fail because NVML is unstable.
  4. Pods enter a UnexpectedAdmissionError loop, which can lead to kube-apiserver and etcd OOM due to high-frequency object churn.

The Fix

By marking all devices as Unhealthy when the health check fails to initialize:

  • The plugin will report 0 healthy GPUs to the Kubelet.
  • Kubelet will update the Node status, and the Scheduler will stop scheduling new GPU workloads to this node.
  • This effectively "fencing" the broken node and protects the cluster control plane from cascading failures.

Verification Results
[✓] Verified that if nvml.Init() fails in the health check goroutine, ListAndWatch sends an unhealthy list to Kubelet.
[✓] Verified that the Node's allocatable GPU count drops to 0 in the Kubernetes API.

Fixes #1640

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

devincd added 2 commits March 3, 2026 15:47
…eckhealth failure.

Update health check error logging to mark devices as unhealthy.

Signed-off-by: devincd <505259926@qq.com>
Signed-off-by: devincd <505259926@qq.com>
@devincd devincd force-pushed the fix/mark-all-unhealthy-on-checkhealth-failure branch from b0f7c75 to 20df1c4 Compare March 3, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Inconsistent NVML initialization state leads to "Zombie Node" and Control Plane OOM

1 participant