Skip to content

[Bug]: nvidia-cuda-mps-server exited with status 1 #1614

@Pavelt132

Description

@Pavelt132

1. Quick Debug Information

  • OS/Version: talos 12.2 bare-matal
  • Kernel Version: 6.18.5-talos
  • Container Runtime Type/Version: containerd://2.1.6
  • K8s Version: v1.32.2
  • k8s-device-plugin: 18.2

2. Issue or feature description

nvidia-cuda-mps-server can't start.
When connecting cuda (triton) clients I see it in the logs:
mps-control-daemon-ctr

ERROR: init 250 result=11I0206 11:43:15.960950     262 main.go:80] "NVIDIA MPS Control Daemon" version=<
	fb1242ad
	commit: fb1242add205d8563f9c604b8ba239607b4083e6
 >
I0206 11:43:15.961226     262 main.go:109] Starting OS watcher.
I0206 11:43:15.961569     262 main.go:123] Starting Daemons.
I0206 11:43:15.961631     262 main.go:166] Loading configuration.
I0206 11:43:15.963717     262 main.go:181] Updating config with default resource matching patterns.
I0206 11:43:15.963841     262 main.go:192]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": null,
    "gdrcopyEnabled": null,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": null,
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": null
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 2
        }
      ]
    }
  },
  "imex": {}
}
I0206 11:43:15.963855     262 main.go:196] Retrieving MPS daemons.
I0206 11:43:16.023714     262 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
I0206 11:43:16.024410     262 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
I0206 11:43:16.051427     262 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
[2026-02-06 11:43:16.044 Control   349] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2026-02-06 11:43:16.044 Control   349] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2026-02-06 11:43:16.045 Control   349] CUDA MPS Control binary version: 13000
[2026-02-06 11:43:16.048 Control   349] Accepting connection...
[2026-02-06 11:43:16.048 Control   349] NEW UI
[2026-02-06 11:43:16.048 Control   349] Cmd:set_default_device_pinned_mem_limit 0 12282M
[2026-02-06 11:43:16.048 Control   349] set_default_device_pinned_mem_limit 0 12282M.
[2026-02-06 11:43:16.048 Control   349] UI closed
[2026-02-06 11:43:16.050 Control   349] Accepting connection...
[2026-02-06 11:43:16.050 Control   349] NEW UI
[2026-02-06 11:43:16.050 Control   349] Cmd:set_default_active_thread_percentage 50
[2026-02-06 11:43:16.050 Control   349] 50.0
[2026-02-06 11:43:16.050 Control   349] UI closed
[2026-02-06 11:44:39.413 Control   349] Accepting connection...
[2026-02-06 11:44:39.413 Control   349] User did not send valid credentials
[2026-02-06 11:44:39.413 Control   349] Accepting connection...
[2026-02-06 11:44:39.413 Control   349] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2026-02-06 11:44:39.415 Control   368] Starting new server 368 for user 0
[2026-02-06 11:44:39.415 Control   349] Server 368 exited with status 1
[2026-02-06 11:44:39.416 Control   369] Starting new server 369 for user 0
[2026-02-06 11:44:39.416 Control   349] Server 369 exited with status 1
[2026-02-06 11:44:39.417 Control   370] Starting new server 370 for user 0
[2026-02-06 11:44:39.417 Control   349] Server 370 exited with status 1
[2026-02-06 11:44:39.418 Control   371] Starting new server 371 for user 0
[2026-02-06 11:44:39.418 Control   349] Server 371 exited with status 1
[2026-02-06 11:44:39.419 Control   372] Starting new server 372 for user 0
[2026-02-06 11:44:39.419 Control   349] Server 372 exited with status 1
[2026-02-06 11:44:39.419 Control   373] Starting new server 373 for user 0
[2026-02-06 11:44:39.420 Control   349] Server 373 exited with status 1

triton
UNAVAILABLE: Internal: unable to get number of CUDA devices: MPS client failed to connect to the MPS control daemon or the MPS server

Make it work when I did the following:
#command: ["mps-control-daemon"]
command: ["/bin/sh"]
args: ["-c", "nvidia-cuda-mps-control -f"]
env:
- name: CUDA_MPS_PIPE_DIRECTORY
value: /mps/nvidia.com/gpu/pipe
- name: CUDA_MPS_LOG_DIRECTORY
value: /mps/nvidia.com/gpu/log
When these changes are made, triton is successfully connected.
logs mps-control-daemon-ctr with this launch:

[2026-02-06 09:34:17.770 Control   261] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2026-02-06 09:34:17.770 Control   261] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2026-02-06 09:34:17.770 Control   261] CUDA MPS Control binary version: 13000
[2026-02-06 09:34:36.380 Control   261] Accepting connection...
[2026-02-06 09:34:36.380 Control   261] User did not send valid credentials
[2026-02-06 09:34:36.380 Control   261] Accepting connection...
[2026-02-06 09:34:36.380 Control   261] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2026-02-06 09:34:36.381 Control   337] Starting new server 337 for user 0
[2026-02-06 09:34:36.386 Other   337] Startup
[2026-02-06 09:34:36.386 Other   337] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
[2026-02-06 09:34:36.386 Control   261] Accepting connection...
[2026-02-06 09:34:36.386 Other   337] Initializing server process
[2026-02-06 09:34:36.397 Server   337] Creating server context on device 0 (NVIDIA GeForce RTX 4090)
[2026-02-06 09:34:36.476 Server   337] Created anonymous shared memory region mps.shm.0.337
[2026-02-06 09:34:36.476 Server   337] CUDA MPS Server for Linux/Unix. Binary version: 13000
[2026-02-06 09:34:36.476 Server   337] Display Driver version: 580
[2026-02-06 09:34:36.476 Control   261] NEW SERVER 337: Ready
[2026-02-06 09:34:36.476 Server   337] Active Threads Percentage set to 100.0
[2026-02-06 09:34:36.476 Server   337] Server Priority set to 0
[2026-02-06 09:34:36.476 Server   337] Server has started
[2026-02-06 09:34:36.476 Server   337] Received new client request for {PID: 1595658136, Context ID: 0}
[2026-02-06 09:34:36.476 Server   337] Client {PID: 0, Context ID: 0} connected
[2026-02-06 09:34:36.476 Server   337] Creating worker thread for client {PID: 0, Context ID: 0}
[2026-02-06 09:34:36.503 Control   261] Accepting connection...
[2026-02-06 09:34:36.503 Control   261] NEW CLIENT 0 from user 0: Server already exists
[2026-02-06 09:34:36.503 Server   337] Received new client request for {PID: 0, Context ID: 0}
[2026-02-06 09:34:36.503 Server   337] Client {PID: 0, Context ID: 1} connected
[2026-02-06 09:34:36.503 Server   337] Creating worker thread for client {PID: 0, Context ID: 1}
[2026-02-06 09:34:36.504 Server   337] Device NVIDIA GeForce RTX 4090 (uuid GPU-65759bdb-b534-ed72-443d-82c3359d3fcb) is associated
[2026-02-06 09:34:36.504 Server   337] Status of client {PID: 0, Context ID: 1} is ACTIVE

questions:

  1. What can I do to see the reason for the server crash with status 1?
  2. Why do these settings help?
command: ["/bin/sh"]
args: ["-c", "nvidia-cuda-mps-control -f"]

mps-control-daemon-ctr

/mps/nvidia.com/gpu/log # ps aux
PID   USER     TIME  COMMAND
    1 65535     0:00 /pause
  175 0         0:00 config-manager
  262 0         0:00 mps-control-daemon
  349 0         0:00 nvidia-cuda-mps-control -d
  355 0         0:00 tail -n +1 -f /mps/nvidia.com/gpu/log/control.log
  362 0         0:00 sh
  422 0         0:00 sh
  429 0         0:00 ps aux
/mps/nvidia.com/gpu/log # cat server.log 
/mps/nvidia.com/gpu/log # 

server.log the logs are empty

/mps/nvidia.com/gpu/log # nvidia-smi

Fri Feb  6 11:56:30 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   35C    P8             19W /  450W |    2995MiB /  24564MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|===================================
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+======================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions