fix: mark all devices unhealthy if NVML health check failed to start by devincd · Pull Request #1641 · NVIDIA/k8s-device-plugin

devincd · 2026-03-03T07:16:43Z

PR Description

Overview

This PR ensures that if the NVML health check fails to initialize during the plugin startup phase (e.g., nvml.Init() returns ERROR_UNKNOWN), all managed devices are immediately marked as Unhealthy.

The Problem

Currently, when CheckHealth fails to start, the plugin logs an error but continues to serve. This creates a "Zombie Node" state where:

The plugin is registered, and the node reports healthy GPU capacity.
The K8s scheduler continues to place Pods on the node.
Kubelet's GetPreferredAllocation calls fail because NVML is unstable.
Pods enter a UnexpectedAdmissionError loop, which can lead to kube-apiserver and etcd OOM due to high-frequency object churn.

The Fix

By marking all devices as Unhealthy when the health check fails to initialize:

The plugin will report 0 healthy GPUs to the Kubelet.
Kubelet will update the Node status, and the Scheduler will stop scheduling new GPU workloads to this node.
This effectively "fencing" the broken node and protects the cluster control plane from cascading failures.

Verification Results
[✓] Verified that if nvml.Init() fails in the health check goroutine, ListAndWatch sends an unhealthy list to Kubelet.
[✓] Verified that the Node's allocatable GPU count drops to 0 in the Kubernetes API.

Fixes #1640

copy-pr-bot · 2026-03-03T07:16:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…eckhealth failure. Update health check error logging to mark devices as unhealthy. Signed-off-by: devincd <505259926@qq.com>

Signed-off-by: devincd <505259926@qq.com>

devincd added 2 commits March 3, 2026 15:47

Improve health check error handling; Mark all unhealthy when start ch…

ccc8fc6

…eckhealth failure. Update health check error logging to mark devices as unhealthy. Signed-off-by: devincd <505259926@qq.com>

update log

20df1c4

Signed-off-by: devincd <505259926@qq.com>

devincd force-pushed the fix/mark-all-unhealthy-on-checkhealth-failure branch from b0f7c75 to 20df1c4 Compare March 3, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: mark all devices unhealthy if NVML health check failed to start#1641

fix: mark all devices unhealthy if NVML health check failed to start#1641
devincd wants to merge 2 commits intoNVIDIA:mainfrom
devincd:fix/mark-all-unhealthy-on-checkhealth-failure

devincd commented Mar 3, 2026

Uh oh!

copy-pr-bot bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devincd commented Mar 3, 2026

PR Description

Overview

The Problem

The Fix

Uh oh!

copy-pr-bot bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant