-
Notifications
You must be signed in to change notification settings - Fork 793
Description
The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-122-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.27.7
2. Issue or feature description
Summary
A critical cascading failure was observed where the nvidia-device-plugin(v0.18.2)enters a "zombie" state after a partial NVML initialization failure. While the plugin successfully registers with the Kubelet, subsequent internal NVML calls fail. This leads to a high-frequency Pod creation loop that eventually causes kube-apiserver and etcd to crash due to OOM.
Expected Behavior
If NVML fails to initialize during any stage of the plugin lifecycle (especially during Start() or GetPreferredAllocation), the plugin should:
- Mark the node's GPU resources as unhealthy (capacity = 0) OR
- Terminate itself to trigger a Container restart, preventing the scheduler from sending more Pods to a broken node.
Current Behavior (The Bug)
The plugin exhibits inconsistent NVML initialization states:
GetPlugins()succeeds: The plugin registers GPUs to the node, making the node appear "Healthy" to the K8s scheduler.p.Start()(Health Check) fails: Logsfailed to initialize NVML: ERROR_UNKNOWN, but the plugin continues running with health checks disabled.GetPreferredAllocationfails: When a Pod is scheduled, Kubelet calls this RPC. The plugin callsalignedAlloc->gpuallocator.NewDevices()->nvml.Init(). This fails withERROR_UNKNOWN, returning an error to Kubelet.- Cascading Failure: The Pod enters
UnexpectedAdmissionError(Failed phase). The Controller-Manager immediately creates a replacement Pod, which is scheduled back to the same node, creating a tight loop that overwhelms the K8s control plane.
Root Cause Analysis (Source Code)
In cmd/nvidia-device-plugin/main.go and internal/plugin/server.go:
-
Redundant Inits: NVML is initialized multiple times across different code paths. In this case, the first init in
GetPlugins()worked, but subsequent inits in theCheckHealth()goroutine andalignedAlloc()failed. -
Lack of Fail-Fast: When
server.go:154logsFailed to start health check, it does not stop the GRPC server or update the device status, leaving the node in a "false-positive" healthy state.
3. Information to attach (optional if deemed irrelevant)
K8s-device-plugin logs
I0213 09:51:50.965469 1 main.go:369] Retrieving plugins.
I0213 09:51:55.562661 1 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
I0213 09:51:55.564144 1 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0213 09:51:55.565671 1 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
E0213 09:51:55.600000 1 server.go:154] Failed to start health check: failed to initialize NVML: ERROR_UNKNOWN; continuing with health checks disabled
Kubelet logs (Event showing Admission Error)
Pod Count Surge
Additional information that might help better understand your environment and reproduce the bug:
- Docker version from
docker version - Docker command, image and tag used
- Kernel version from
uname -a - Any relevant kernel output lines from
dmesg - NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*' - NVIDIA Container Toolkit version from
nvidia-ctk --version