Skip to content

[Bug]: Inconsistent NVML initialization state leads to "Zombie Node" and Control Plane OOM #1640

@devincd

Description

@devincd

The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 5.15.0-122-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.27.7

2. Issue or feature description

Summary

A critical cascading failure was observed where the nvidia-device-plugin(v0.18.2)enters a "zombie" state after a partial NVML initialization failure. While the plugin successfully registers with the Kubelet, subsequent internal NVML calls fail. This leads to a high-frequency Pod creation loop that eventually causes kube-apiserver and etcd to crash due to OOM.

Expected Behavior

If NVML fails to initialize during any stage of the plugin lifecycle (especially during Start() or GetPreferredAllocation), the plugin should:

  • Mark the node's GPU resources as unhealthy (capacity = 0) OR
  • Terminate itself to trigger a Container restart, preventing the scheduler from sending more Pods to a broken node.

Current Behavior (The Bug)

The plugin exhibits inconsistent NVML initialization states:

  1. GetPlugins() succeeds: The plugin registers GPUs to the node, making the node appear "Healthy" to the K8s scheduler.
  2. p.Start() (Health Check) fails: Logs failed to initialize NVML: ERROR_UNKNOWN, but the plugin continues running with health checks disabled.
  3. GetPreferredAllocation fails: When a Pod is scheduled, Kubelet calls this RPC. The plugin calls alignedAlloc -> gpuallocator.NewDevices() -> nvml.Init(). This fails with ERROR_UNKNOWN, returning an error to Kubelet.
  4. Cascading Failure: The Pod enters UnexpectedAdmissionError (Failed phase). The Controller-Manager immediately creates a replacement Pod, which is scheduled back to the same node, creating a tight loop that overwhelms the K8s control plane.

Root Cause Analysis (Source Code)

In cmd/nvidia-device-plugin/main.go and internal/plugin/server.go:

  • Redundant Inits: NVML is initialized multiple times across different code paths. In this case, the first init in GetPlugins() worked, but subsequent inits in the CheckHealth() goroutine and alignedAlloc() failed.

  • Lack of Fail-Fast: When server.go:154 logs Failed to start health check, it does not stop the GRPC server or update the device status, leaving the node in a "false-positive" healthy state.

3. Information to attach (optional if deemed irrelevant)

K8s-device-plugin logs

I0213 09:51:50.965469       1 main.go:369] Retrieving plugins.
I0213 09:51:55.562661       1 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
I0213 09:51:55.564144       1 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0213 09:51:55.565671       1 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
E0213 09:51:55.600000       1 server.go:154] Failed to start health check: failed to initialize NVML: ERROR_UNKNOWN; continuing with health checks disabled

Kubelet logs (Event showing Admission Error)

Image

Pod Count Surge

Image

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA Container Toolkit version from nvidia-ctk --version

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions