Skip to content

Fix mps-control-daemon chroot shell execution error when nvidiaDriverRoot is set#889

Open
kasia-kujawa wants to merge 1 commit intoNVIDIA:mainfrom
kasia-kujawa:kkujawa_fix_mps
Open

Fix mps-control-daemon chroot shell execution error when nvidiaDriverRoot is set#889
kasia-kujawa wants to merge 1 commit intoNVIDIA:mainfrom
kasia-kujawa:kkujawa_fix_mps

Conversation

@kasia-kujawa
Copy link

Fixes #469

Fixes the mps-control-daemon error chroot: can't execute 'sh': No such file or directory, observed on GKE clusters.

When nvidiaDriverRoot needs to be set to nvidiaDriverRoot: /home/kubernetes/bin/nvidia/, when GPU drivers are installed via the GKE-provided DaemonSet in this directory.

The issue occurs because the original command structure attempts to execute sh from within the chrooted driver root filesystem, which lacks shell utilities and only contains GPU driver libraries and binaries.

GKE driver root:

/home/kubernetes/bin/nvidia/bin # ls
nvidia-bug-report.sh     nvidia-cuda-mps-server  nvidia-installer  nvidia-ngx-updater  nvidia-persistenced  nvidia-settings  nvidia-smi        nvidia-xconfig
nvidia-cuda-mps-control  nvidia-debugdump        nvidia-modprobe   nvidia-pcc          nvidia-powerd        nvidia-sleep.sh  nvidia-uninstall

The proposed fix was tested on:

  • GKE cluster with nvidiaDriverRoot: /home/kubernetes/bin/nvidia/
  • EKS cluster where nvidiaDriverRoot uses the default value (/)

Tested workload

---
apiVersion: v1
kind: Namespace
metadata:
  name:  gpu-test-mps
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace:  gpu-test-mps
  name: shared-gpu
spec:
  spec:
    devices:
      requests:
      - name: mps-gpu
        exactly:
          deviceClassName: gpu.nvidia.com
      config:
      - requests: ["mps-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: MPS
              mpsConfig:
                defaultActiveThreadPercentage: 50
                defaultPinnedDeviceMemoryLimit: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace:  gpu-test-mps
  name: test-deployment
  labels:
    app:  gpu-test-mps
spec:
  replicas: 1
  selector:
    matchLabels:
      app:  gpu-test-mps
  template:
    metadata:
      labels:
        app:  gpu-test-mps
    spec:
      containers:
      - name: mps-ctr0
        image: nvidia/samples:nbody
        command: [ "/bin/sh", "-c" ]
        args: [ "while true; do /tmp/nbody -benchmark -i=500000000; done" ]
        resources:
          claims:
          - name: shared-gpu
            request: mps-gpu
      - name: mps-ctr1
        image: nvidia/samples:nbody
        command: [ "/bin/sh", "-c" ]
        args: [ "while true; do /tmp/nbody -benchmark -i=500000000; done" ]
        resources:
          claims:
          - name: shared-gpu
            request: mps-gpu
      resourceClaims:
      - name: shared-gpu
        resourceClaimTemplateName: shared-gpu
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

GKE tests

logs from nvidia mps-control-daemon

kubectl logs -n nvidia mps-control-daemon-33af05cb-8c29-4b5a-9910-c1e781107afc-e74fhhl 
50.0
[2026-02-20 11:02:48.033 Control  6436] Starting control daemon using socket /driver-root/tmp/nvidia-mps/control
[2026-02-20 11:02:48.034 Control  6436] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/driver-root/tmp/nvidia-mps
[2026-02-20 11:02:48.034 Control  6436] CUDA MPS Control binary version: 13000
[2026-02-20 11:02:48.041 Control  6436] Accepting connection...
[2026-02-20 11:02:48.041 Control  6436] NEW UI
[2026-02-20 11:02:48.041 Control  6436] Cmd:set_default_active_thread_percentage 50
[2026-02-20 11:02:48.041 Control  6436] 50.0
[2026-02-20 11:02:48.042 Control  6436] UI closed
[2026-02-20 11:02:48.055 Control  6436] Accepting connection...
[2026-02-20 11:02:48.055 Control  6436] NEW UI
[2026-02-20 11:02:48.056 Control  6436] Cmd:set_default_device_pinned_mem_limit GPU-035c9eac-4f4d-85f9-6ebc-78235d9db128 10240M
[2026-02-20 11:02:48.057 Control  6436] set_default_device_pinned_mem_limit GPU-035c9eac-4f4d-85f9-6ebc-78235d9db128 10240M.
[2026-02-20 11:02:48.057 Control  6436] UI closed
[2026-02-20 11:03:02.410 Control  6436] Accepting connection...
[2026-02-20 11:03:02.411 Control  6436] User did not send valid credentials
[2026-02-20 11:03:02.411 Control  6436] Accepting connection...
[2026-02-20 11:03:02.411 Control  6436] NEW CLIENT 6976 from user 0: Server is not ready, push client to pending list
[2026-02-20 11:03:02.412 Control  6980] Starting new server 6980 for user 0
[2026-02-20 11:03:02.420 Control  6436] Accepting connection...
[2026-02-20 11:03:02.861 Control  6436] NEW SERVER 6980: Ready
[2026-02-20 11:03:02.985 Control  6436] Accepting connection...
[2026-02-20 11:03:02.985 Control  6436] NEW CLIENT 6976 from user 0: Server already exists
[2026-02-20 11:03:04.871 Control  6436] Accepting connection...
[2026-02-20 11:03:04.871 Control  6436] User did not send valid credentials
[2026-02-20 11:03:04.871 Control  6436] Accepting connection...
[2026-02-20 11:03:04.871 Control  6436] NEW CLIENT 7056 from user 0: Server already exists
[2026-02-20 11:03:05.050 Control  6436] Accepting connection...
[2026-02-20 11:03:05.050 Control  6436] NEW CLIENT 7056 from user 0: Server already exists

on the node:

# ./nvidia-smi
Fri Feb 20 11:08:29 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             67W /   70W |     167MiB /  15360MiB |    100%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            6976    M+C   /tmp/nbody                               68MiB |
|    0   N/A  N/A            6980      C   nvidia-cuda-mps-server                   28MiB |
|    0   N/A  N/A            7056    M+C   /tmp/nbody                               68MiB |
+-----------------------------------------------------------------------------------------+

EKS tests

logs from nvidia mps-control-daemon

kubectl logs -n nvidia mps-control-daemon-4de818fe-078a-455e-a099-6fd1d0957fcb-654t4hs
50.0
[2026-02-20 11:02:54.435 Control  8950] Starting control daemon using socket /tmp/nvidia-mps/control
[2026-02-20 11:02:54.435 Control  8950] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
[2026-02-20 11:02:54.435 Control  8950] CUDA MPS Control binary version: 13000
[2026-02-20 11:02:54.439 Control  8950] Accepting connection...
[2026-02-20 11:02:54.439 Control  8950] NEW UI
[2026-02-20 11:02:54.439 Control  8950] Cmd:set_default_active_thread_percentage 50
[2026-02-20 11:02:54.439 Control  8950] 50.0
[2026-02-20 11:02:54.439 Control  8950] UI closed
[2026-02-20 11:02:54.442 Control  8950] Accepting connection...
[2026-02-20 11:02:54.442 Control  8950] NEW UI
[2026-02-20 11:02:54.442 Control  8950] Cmd:set_default_device_pinned_mem_limit GPU-c446c56f-1ae6-1ac4-d1a1-d8faf6817d9e 10240M
[2026-02-20 11:02:54.442 Control  8950] set_default_device_pinned_mem_limit GPU-c446c56f-1ae6-1ac4-d1a1-d8faf6817d9e 10240M.
[2026-02-20 11:02:54.442 Control  8950] UI closed
[2026-02-20 11:03:01.947 Control  8950] Accepting connection...
[2026-02-20 11:03:01.947 Control  8950] User did not send valid credentials
[2026-02-20 11:03:01.947 Control  8950] Accepting connection...
[2026-02-20 11:03:01.947 Control  8950] NEW CLIENT 9232 from user 0: Server is not ready, push client to pending list
[2026-02-20 11:03:01.947 Control  9234] Starting new server 9234 for user 0
[2026-02-20 11:03:01.952 Control  8950] Accepting connection...
[2026-02-20 11:03:02.131 Control  8950] NEW SERVER 9234: Ready
[2026-02-20 11:03:02.135 Control  8950] Accepting connection...
[2026-02-20 11:03:02.135 Control  8950] NEW CLIENT 9232 from user 0: Server already exists
[2026-02-20 11:03:02.194 Control  8950] Accepting connection...
[2026-02-20 11:03:02.194 Control  8950] User did not send valid credentials
[2026-02-20 11:03:02.194 Control  8950] Accepting connection...
[2026-02-20 11:03:02.194 Control  8950] NEW CLIENT 9330 from user 0: Server already exists
[2026-02-20 11:03:02.199 Control  8950] Accepting connection...
[2026-02-20 11:03:02.199 Control  8950] NEW CLIENT 9330 from user 0: Server already exists

on the node:

# nvidia-smi
Fri Feb 20 11:12:36 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   52C    P0             67W /   70W |     167MiB /  15360MiB |    100%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            9232    M+C   /tmp/nbody                               68MiB |
|    0   N/A  N/A            9234      C   nvidia-cuda-mps-server                   28MiB |
|    0   N/A  N/A            9330    M+C   /tmp/nbody                               68MiB |
+-----------------------------------------------------------------------------------------+

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shivamerla shivamerla added the bug Issue/PR to expose/discuss/fix a bug label Feb 23, 2026
@shivamerla shivamerla added this to the v26.4.0 milestone Feb 23, 2026
@shivamerla
Copy link
Contributor

@kasia-kujawa please sign your commits. You can refer to the contribution guide here.

…Root is set

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

mps-control-daemon is restarting with GPU drivers installed through GKE daemonset

2 participants