-
Notifications
You must be signed in to change notification settings - Fork 794
Open
Labels
needs-triageissue or PR has not been assigned a priority-px labelissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.Categorizes issue or PR as a support question.
Description
1. Quick Debug Information
- OS/Version: talos 12.2 bare-matal
- Kernel Version: 6.18.5-talos
- Container Runtime Type/Version: containerd://2.1.6
- K8s Version: v1.32.2
- k8s-device-plugin: 18.2
2. Issue or feature description
nvidia-cuda-mps-server can't start.
When connecting cuda (triton) clients I see it in the logs:
mps-control-daemon-ctr
ERROR: init 250 result=11I0206 11:43:15.960950 262 main.go:80] "NVIDIA MPS Control Daemon" version=<
fb1242ad
commit: fb1242add205d8563f9c604b8ba239607b4083e6
>
I0206 11:43:15.961226 262 main.go:109] Starting OS watcher.
I0206 11:43:15.961569 262 main.go:123] Starting Daemons.
I0206 11:43:15.961631 262 main.go:166] Loading configuration.
I0206 11:43:15.963717 262 main.go:181] Updating config with default resource matching patterns.
I0206 11:43:15.963841 262 main.go:192]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": null,
"gdrcopyEnabled": null,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": null,
"plugin": {
"passDeviceSpecs": null,
"deviceListStrategy": null,
"deviceIDStrategy": null,
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": null
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {},
"mps": {
"failRequestsGreaterThanOne": true,
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 2
}
]
}
},
"imex": {}
}
I0206 11:43:15.963855 262 main.go:196] Retrieving MPS daemons.
I0206 11:43:16.023714 262 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
I0206 11:43:16.024410 262 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
I0206 11:43:16.051427 262 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
[2026-02-06 11:43:16.044 Control 349] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2026-02-06 11:43:16.044 Control 349] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2026-02-06 11:43:16.045 Control 349] CUDA MPS Control binary version: 13000
[2026-02-06 11:43:16.048 Control 349] Accepting connection...
[2026-02-06 11:43:16.048 Control 349] NEW UI
[2026-02-06 11:43:16.048 Control 349] Cmd:set_default_device_pinned_mem_limit 0 12282M
[2026-02-06 11:43:16.048 Control 349] set_default_device_pinned_mem_limit 0 12282M.
[2026-02-06 11:43:16.048 Control 349] UI closed
[2026-02-06 11:43:16.050 Control 349] Accepting connection...
[2026-02-06 11:43:16.050 Control 349] NEW UI
[2026-02-06 11:43:16.050 Control 349] Cmd:set_default_active_thread_percentage 50
[2026-02-06 11:43:16.050 Control 349] 50.0
[2026-02-06 11:43:16.050 Control 349] UI closed
[2026-02-06 11:44:39.413 Control 349] Accepting connection...
[2026-02-06 11:44:39.413 Control 349] User did not send valid credentials
[2026-02-06 11:44:39.413 Control 349] Accepting connection...
[2026-02-06 11:44:39.413 Control 349] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2026-02-06 11:44:39.415 Control 368] Starting new server 368 for user 0
[2026-02-06 11:44:39.415 Control 349] Server 368 exited with status 1
[2026-02-06 11:44:39.416 Control 369] Starting new server 369 for user 0
[2026-02-06 11:44:39.416 Control 349] Server 369 exited with status 1
[2026-02-06 11:44:39.417 Control 370] Starting new server 370 for user 0
[2026-02-06 11:44:39.417 Control 349] Server 370 exited with status 1
[2026-02-06 11:44:39.418 Control 371] Starting new server 371 for user 0
[2026-02-06 11:44:39.418 Control 349] Server 371 exited with status 1
[2026-02-06 11:44:39.419 Control 372] Starting new server 372 for user 0
[2026-02-06 11:44:39.419 Control 349] Server 372 exited with status 1
[2026-02-06 11:44:39.419 Control 373] Starting new server 373 for user 0
[2026-02-06 11:44:39.420 Control 349] Server 373 exited with status 1
triton
UNAVAILABLE: Internal: unable to get number of CUDA devices: MPS client failed to connect to the MPS control daemon or the MPS server
Make it work when I did the following:
#command: ["mps-control-daemon"]
command: ["/bin/sh"]
args: ["-c", "nvidia-cuda-mps-control -f"]
env:
- name: CUDA_MPS_PIPE_DIRECTORY
value: /mps/nvidia.com/gpu/pipe
- name: CUDA_MPS_LOG_DIRECTORY
value: /mps/nvidia.com/gpu/log
When these changes are made, triton is successfully connected.
logs mps-control-daemon-ctr with this launch:
[2026-02-06 09:34:17.770 Control 261] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2026-02-06 09:34:17.770 Control 261] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2026-02-06 09:34:17.770 Control 261] CUDA MPS Control binary version: 13000
[2026-02-06 09:34:36.380 Control 261] Accepting connection...
[2026-02-06 09:34:36.380 Control 261] User did not send valid credentials
[2026-02-06 09:34:36.380 Control 261] Accepting connection...
[2026-02-06 09:34:36.380 Control 261] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2026-02-06 09:34:36.381 Control 337] Starting new server 337 for user 0
[2026-02-06 09:34:36.386 Other 337] Startup
[2026-02-06 09:34:36.386 Other 337] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
[2026-02-06 09:34:36.386 Control 261] Accepting connection...
[2026-02-06 09:34:36.386 Other 337] Initializing server process
[2026-02-06 09:34:36.397 Server 337] Creating server context on device 0 (NVIDIA GeForce RTX 4090)
[2026-02-06 09:34:36.476 Server 337] Created anonymous shared memory region mps.shm.0.337
[2026-02-06 09:34:36.476 Server 337] CUDA MPS Server for Linux/Unix. Binary version: 13000
[2026-02-06 09:34:36.476 Server 337] Display Driver version: 580
[2026-02-06 09:34:36.476 Control 261] NEW SERVER 337: Ready
[2026-02-06 09:34:36.476 Server 337] Active Threads Percentage set to 100.0
[2026-02-06 09:34:36.476 Server 337] Server Priority set to 0
[2026-02-06 09:34:36.476 Server 337] Server has started
[2026-02-06 09:34:36.476 Server 337] Received new client request for {PID: 1595658136, Context ID: 0}
[2026-02-06 09:34:36.476 Server 337] Client {PID: 0, Context ID: 0} connected
[2026-02-06 09:34:36.476 Server 337] Creating worker thread for client {PID: 0, Context ID: 0}
[2026-02-06 09:34:36.503 Control 261] Accepting connection...
[2026-02-06 09:34:36.503 Control 261] NEW CLIENT 0 from user 0: Server already exists
[2026-02-06 09:34:36.503 Server 337] Received new client request for {PID: 0, Context ID: 0}
[2026-02-06 09:34:36.503 Server 337] Client {PID: 0, Context ID: 1} connected
[2026-02-06 09:34:36.503 Server 337] Creating worker thread for client {PID: 0, Context ID: 1}
[2026-02-06 09:34:36.504 Server 337] Device NVIDIA GeForce RTX 4090 (uuid GPU-65759bdb-b534-ed72-443d-82c3359d3fcb) is associated
[2026-02-06 09:34:36.504 Server 337] Status of client {PID: 0, Context ID: 1} is ACTIVE
questions:
- What can I do to see the reason for the server crash with status 1?
- Why do these settings help?
command: ["/bin/sh"]
args: ["-c", "nvidia-cuda-mps-control -f"]
mps-control-daemon-ctr
/mps/nvidia.com/gpu/log # ps aux
PID USER TIME COMMAND
1 65535 0:00 /pause
175 0 0:00 config-manager
262 0 0:00 mps-control-daemon
349 0 0:00 nvidia-cuda-mps-control -d
355 0 0:00 tail -n +1 -f /mps/nvidia.com/gpu/log/control.log
362 0 0:00 sh
422 0 0:00 sh
429 0 0:00 ps aux
/mps/nvidia.com/gpu/log # cat server.log
/mps/nvidia.com/gpu/log #
server.log the logs are empty
/mps/nvidia.com/gpu/log # nvidia-smi
Fri Feb 6 11:56:30 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 35C P8 19W / 450W | 2995MiB / 24564MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|===================================
| No running processes found |
+-----------------------------------------------------------------------------------------+======================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
needs-triageissue or PR has not been assigned a priority-px labelissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.Categorizes issue or PR as a support question.