-
Notifications
You must be signed in to change notification settings - Fork 794
Closed as not planned
Closed as not planned
Copy link
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
Bare metal K3S v1.33.3+k3s1 on kernel 6.15.11-2-MANJARO.
Not a new install; this had been stable for many months. Rebooted node with the GPU, now POD crash loops with message:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: ldcache error: process /sbin/ldconfig terminated with signal 9
I'm confused by the OCI, Legacy and ldcache references.
Chart reference in ArgoCD:
- repoURL: https://nvidia.github.io/k8s-device-plugin
chart: nvidia-device-plugin
targetRevision: 0.17.3
Helm Values File:
---
# yaml-language-server: $schema=https://json.schemastore.org/helmfile
nodeSelector:
nvidia.feature.node.kubernetes.io/gpu.3060: "true"
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.feature.node.kubernetes.io/gpu.3060
operator: In
values:
- "true"
config:
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 6
# Subcharts
nfd: {}
gfd:
enabled: false
- NFD is already installed via its own Helm Chart.
Current versions on host:
$ pacman -Q libnvidia-container
libnvidia-container 1.17.8-1
$ pacman -Q nvidia-container-toolkit
nvidia-container-toolkit 1.17.8-1
$ pacman -Q nvidia-utils
nvidia-utils 575.64.05-1
$ nvidia-smi
Tue Sep 2 15:33:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:09:00.0 On | N/A |
| 0% 54C P3 30W / 170W | 1665MiB / 12288MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
$ nvidia-container-cli info
NVRM version: 575.64.05
CUDA version: 12.9
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3060
Brand: GeForce
GPU UUID: GPU-ace6a26d-6a78-9562-4fbc-69984c397347
Bus Location: 00000000:09:00.0
Architecture: 8.6
$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/libnvidia-ml.so.575.64.05
/usr/lib/libnvidia-cfg.so.575.64.05
/usr/lib/libcuda.so.575.64.05
/usr/lib/libcudadebugger.so.575.64.05
/usr/lib/libnvidia-gpucomp.so.575.64.05
/usr/lib/libnvidia-ptxjitcompiler.so.575.64.05
/usr/lib/libnvidia-allocator.so.575.64.05
/usr/lib/libnvidia-pkcs11.so.575.64.05
/usr/lib/libnvidia-pkcs11-openssl3.so.575.64.05
/usr/lib/libnvidia-nvvm.so.575.64.05
/usr/lib/libnvidia-ngx.so.575.64.05
/usr/lib/libnvidia-encode.so.575.64.05
/usr/lib/libnvidia-opticalflow.so.575.64.05
/usr/lib/libnvcuvid.so.575.64.05
/usr/lib/libnvidia-eglcore.so.575.64.05
/usr/lib/libnvidia-glcore.so.575.64.05
/usr/lib/libnvidia-tls.so.575.64.05
/usr/lib/libnvidia-glsi.so.575.64.05
/usr/lib/libnvidia-fbc.so.575.64.05
/usr/lib/libnvidia-rtcore.so.575.64.05
/usr/lib/libnvoptix.so.575.64.05
/usr/lib/libGLX_nvidia.so.575.64.05
/usr/lib/libEGL_nvidia.so.575.64.05
/usr/lib/libGLESv2_nvidia.so.575.64.05
/usr/lib/libGLESv1_CM_nvidia.so.575.64.05
/usr/lib/libnvidia-glvkspirv.so.575.64.05
/usr/lib32/libnvidia-ml.so.575.64.05
/usr/lib32/libcuda.so.575.64.05
/usr/lib32/libnvidia-gpucomp.so.575.64.05
/usr/lib32/libnvidia-ptxjitcompiler.so.575.64.05
/usr/lib32/libnvidia-allocator.so.575.64.05
/usr/lib32/libnvidia-encode.so.575.64.05
/usr/lib32/libnvidia-opticalflow.so.575.64.05
/usr/lib32/libnvcuvid.so.575.64.05
/usr/lib32/libnvidia-eglcore.so.575.64.05
/usr/lib32/libnvidia-glcore.so.575.64.05
/usr/lib32/libnvidia-tls.so.575.64.05
/usr/lib32/libnvidia-glsi.so.575.64.05
/usr/lib32/libnvidia-fbc.so.575.64.05
/usr/lib32/libGLX_nvidia.so.575.64.05
/usr/lib32/libEGL_nvidia.so.575.64.05
/usr/lib32/libGLESv2_nvidia.so.575.64.05
/usr/lib32/libGLESv1_CM_nvidia.so.575.64.05
/usr/lib32/libnvidia-glvkspirv.so.575.64.05
/lib/firmware/nvidia/575.64.05/gsp_ga10x.bin
/lib/firmware/nvidia/575.64.05/gsp_tu10x.bin
From K3S config:
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
BinaryName = "/usr/bin/nvidia-container-runtime.cdi"
SystemdCgroup = true
$ k get all -n nvidia
NAME READY STATUS RESTARTS AGE
pod/nvidia-device-plugin-268bb 0/2 Init:CrashLoopBackOff 16 (50s ago) 60m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/nvidia-device-plugin 1 1 0 1 0 nvidia.feature.node.kubernetes.io/gpu.3060=true 60m
daemonset.apps/nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/mps.capable=true,nvidia.feature.node.kubernetes.io/gpu.3060=true 60m
Nvidia packages have not been updated recently on the host:
$ ls -ltR /var/cache/pacman/pkg/nvidia*.zst
.rw-r--r-- root root 80 KB Mon Aug 4 11:36:53 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.22.65.06-1-any.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jul 22 13:48:56 2025 /var/cache/pacman/pkg/nvidia-utils-575.64.05-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jul 1 17:02:35 2025 /var/cache/pacman/pkg/nvidia-utils-575.64.03-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jun 17 14:26:14 2025 /var/cache/pacman/pkg/nvidia-utils-575.64-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 79 KB Tue Jun 3 21:00:32 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.21.57.08-1-any.pkg.tar.zst
.rw-r--r-- root root 4.3 MB Sun Jun 1 11:33:12 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.8-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 79 KB Thu May 1 23:38:55 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.21.51.03-1-any.pkg.tar.zst
.rw-r--r-- root root 4.3 MB Sat Apr 26 11:27:21 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.6-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 4.2 MB Thu Mar 13 11:17:51 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.5-1-x86_64.pkg.tar.zst
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.