Skip to content

NVIDIA Device Plugin Fails to Detect GPU Despite Proper Environment Configuration: "Incompatible strategy detected auto" #1574

@UrmsOne

Description

@UrmsOne

What I tried

On Ubuntu 24.04, I have installed Kubernetes v1.34.3 and containerd v1.7.28. The GPU is NVIDIA GeForce RTX 5090. The NVIDIA driver, nvidia-container-toolkit, and containerd are configured.

Env

nvidia driver

root@gpu-node-5090-1:~/proxy# nvidia-smi
Sun Dec 21 17:39:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   35C    P8             11W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2725      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+
root@gpu-node-5090-1:~/proxy#

nvidia-container-toolkit version

root@gpu-node-5090-1:~/proxy# nvidia-container-cli --version
cli-version: 1.18.1
lib-version: 1.18.1
build date: 2025-11-24T14:45+00:00
build revision: 889a3bb5408c195ed7897ba2cb8341c7d249672f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64

containerd

root@gpu-node-5090-1:~/proxy# containerd version
INFO[2025-12-21T17:40:26.724003653+08:00] starting containerd                           revision= version=1.7.28

k8s

root@gpu-node-5090-1:~/proxy# kubectl version
Client Version: v1.34.3
Kustomize Version: v5.7.1
Server Version: v1.34.3

nvidia-device-plugin 0.17.1

what I met

Containers launched using the ctr command can use the nvidia-smi command normally inside.

root@gpu-node-5090-1:~/k8s-test# ctr -n k8s.io run  --rm    --runtime io.containerd.runc.v2     --gpus 0   docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04     cuda-test     nvidia-smi
Fri Dec 19 15:17:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   35C    P8             11W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But, in Kubernetes, the pod of nvidia-device-plugin can run normally, but its logs indicate that it fails to recognize the GPU.

NVIDIA Device Plugin pod is running

root@gpu-node-5090-1:~/proxy# kubectl -n kube-system get po  |grep nvidia
nvidia-device-plugin-daemonset-4ckwp      1/1     Running   0          42h

But,

I1219 12:04:10.062602       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
	3c378193
	commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea
 >
I1219 12:04:10.062626       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1219 12:04:10.062648       1 main.go:245] Starting OS watcher.
I1219 12:04:10.063447       1 main.go:260] Starting Plugins.
I1219 12:04:10.063460       1 main.go:317] Loading configuration.
I1219 12:04:10.063674       1 main.go:342] Updating config with default resource matching patterns.
I1219 12:04:10.063800       1 main.go:353]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1219 12:04:10.063806       1 main.go:356] Retrieving plugins.
E1219 12:04:10.063872       1 factory.go:112] Incompatible strategy detected auto
E1219 12:04:10.063875       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1219 12:04:10.063877       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1219 12:04:10.063880       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1219 12:04:10.063882       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1219 12:04:10.063885       1 main.go:381] No devices found. Waiting indefinitely.

How should I resolve this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions