Handle multiple GPUs in CDI spec generation from CSV by elezar · Pull Request #1461 · NVIDIA/nvidia-container-toolkit

elezar · 2025-11-17T15:10:23Z

This change allows CDI specs to be generated for multiple
devices when using CSV mode. This can be used in cases where
a Tegra-based system consists of an iGPU and dGPU.

This behavior can be opted out of using the disable-multiple-csv-devices
feature flag. This can be specified by adding the

            --feaure-flags=disable-multiple-csv-devices

command line option to the nvidia-ctk cdi generate command or to the
automatic CDI spec generation by adding

    NVIDIA_CTK_CDI_GENERATE_FEATURE_FLAGS=disable-multiple-csv-devices

to the /etc/nvidia-container-toolkit/nvidia-cdi-refresh.env file.

ArangoGutierrez

LGTM, just 2 non-blocking nits

internal/platform-support/tegra/csv.go

pkg/nvcdi/lib-csv.go

coveralls · 2025-12-03T13:05:05Z

Pull Request Test Coverage Report for Build 20024248479

Details

30 of 366 (8.2%) changed or added relevant lines in 9 files are covered.
4 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.6%) to 37.031%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
internal/platform-support/tegra/tegra.go	0	7	0.0%
internal/platform-support/tegra/csv.go	4	12	33.33%
pkg/nvcdi/full-gpu-nvml.go	0	9	0.0%
pkg/nvcdi/common-nvml.go	0	13	0.0%
internal/platform-support/tegra/options.go	0	28	0.0%
internal/platform-support/tegra/filter.go	10	44	22.73%
internal/platform-support/tegra/mount_specs.go	15	63	23.81%
pkg/nvcdi/lib-csv.go	0	189	0.0%

Files with Coverage Reduction	New Missed Lines	%
internal/platform-support/tegra/tegra.go	1	0.0%
pkg/nvcdi/lib-csv.go	1	0.0%
pkg/nvcdi/full-gpu-nvml.go	2	18.85%

Totals
Change from base Build 20024159196:	-0.6%
Covered Lines:	5197
Relevant Lines:	14034

💛 - Coveralls

elezar · 2025-12-03T13:51:46Z

I have split two of the commits originally included here into their own PRs: #1511 and #1512

ArangoGutierrez

LGTM - I'll now proceed to review the spin-off PR's

Updated commits

Signed-off-by: Evan Lezar <elezar@nvidia.com>

This change updates the way we construct a discoverer for tegra systems to be more flexible in terms of how the SOURCES of the mount specs can be specified. This allows for subsequent changes like adding (or removing) mount specs at the point of construction. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

This change allows CDI specs to be generated for multiple devices when using CSV mode. This can be used in cases where a Tegra-based system consists of an iGPU and dGPU. This behavior can be opted out of using the disable-multiple-csv-devices feature flag. This can be specified by adding the --feaure-flags=disable-multiple-csv-devices command line option to the nvidia-ctk cdi generate command or to the automatic CDI spec generation by adding NVIDIA_CTK_CDI_GENERATE_FEATURE_FLAGS=disable-multiple-csv-devices to the /etc/nvidia-container-toolkit/nvidia-cdi-refresh.env file. Signed-off-by: Evan Lezar <elezar@nvidia.com>

ArangoGutierrez

LGTM

cdesiniotis · 2026-02-22T22:20:05Z

pkg/nvcdi/lib-csv.go

+func isIntegratedGPUID(id device.Identifier) bool {
+	_, err := uuid.Parse(string(id))
+	return err == nil
+}


Question -- would this method not also return true for a discrete GPU identifier? The isIntegratedGPUID method name seems to indicate this is unique to discrete GPUs...

EDIT: Okay I think I see why this method is needed. Based on reading other parts of the code, I am assuming id.IsGpuUUID() returns false for integrated GPU UUIDs? Some more context would help here.

This is one of those heuristics I keep mentioning. Currently, UUIDs for discrete GPUs have a GPU- prefix (MIG- for MIG devices), and on all Tegra-based systems that I have had access to the UUID is a "standard" UUID for example:

$ nvidia-smi -L GPU 0: Orin (nvgpu) (UUID: 1833c8b5-9aa0-5382-b784-68b7e77eb185)

We have been pushing the NVML team for an "IsIntegrated" API, but have not had a commitment.

Let me update the function comment.

Created #1674 to update the comment.

Ah okay, this makes sense now! Thanks for the additional context and raising the PR.

cdesiniotis · 2026-02-22T22:26:30Z

pkg/nvcdi/lib-csv.go

+	if pciInfo.Bus != 1 {
+		return false, nil
+	}
+	return pciInfo.Device == 0, nil


Question -- does this mean that integrated GPUs (even though they are not attached to the PCI bus) always appear to have a PCI address of 0000:01:00 (domain:bus:device)?

nit: as a reader, this may be easier to grok if rewritten as

Suggested change

return pciInfo.Device == 0, nil

if pciInfo.Domain == 0 && pciInfo.Bus == 1 && pci.Device == 0 {

return true, nil

}

return false, nil

At least in the case of Thor-based systems that I have had access to, this has been the case. Orin-based systems that I have had access to do not support getting PCI information. I will update the implementation for clarity.

cdesiniotis · 2026-02-22T23:22:34Z

pkg/nvcdi/lib-csv.go

+				csvDeviceNodeDiscoverer,
+			},
+			featureFlags: l.featureFlags,
+		})


Question -- What are the differences between the device specs generated for dGPUs and iGPUs? Is the addition of control device nodes (e.g. nvidiactl, nvidia-uvm) the main difference?

The device specs generated for iGPUs depend entirely on the contents of the /etc/nvidia-container-runtime/host-files-for-container.d/devices.csv file that is constructed by the platform team. For example, on an Orin-based system I have:

$ cat /etc/nvidia-container-runtime/host-files-for-container.d/devices.csv dev, /dev/dri/card* dev, /dev/dri/renderD* dir, /dev/dri/by-path dev, /dev/fb0 dev, /dev/fb1 dev, /dev/host1x-fence dev, /dev/nvhost-as-gpu dev, /dev/nvhost-ctrl-gpu dev, /dev/nvhost-ctrl-nvdla0 dev, /dev/nvhost-ctrl-nvdla1 dev, /dev/nvhost-ctrl-pva0 dev, /dev/nvhost-ctxsw-gpu dev, /dev/nvhost-dbg-gpu dev, /dev/nvhost-gpu dev, /dev/nvhost-nvsched-gpu dev, /dev/nvhost-power-gpu dev, /dev/nvhost-prof-ctx-gpu dev, /dev/nvhost-prof-dev-gpu dev, /dev/nvhost-prof-gpu dev, /dev/nvhost-sched-gpu dev, /dev/nvhost-tsg-gpu dev, /dev/nvgpu/igpu0/as dev, /dev/nvgpu/igpu0/channel dev, /dev/nvgpu/igpu0/ctrl dev, /dev/nvgpu/igpu0/ctxsw dev, /dev/nvgpu/igpu0/dbg dev, /dev/nvgpu/igpu0/nvsched dev, /dev/nvgpu/igpu0/power dev, /dev/nvgpu/igpu0/prof dev, /dev/nvgpu/igpu0/prof-ctx dev, /dev/nvgpu/igpu0/prof-dev dev, /dev/nvgpu/igpu0/sched dev, /dev/nvgpu/igpu0/tsg dev, /dev/nvidia-modeset dev, /dev/nvidia0 dev, /dev/nvidiactl dev, /dev/nvmap dev, /dev/nvsciipc dev, /dev/v4l2-nvdec dev, /dev/v4l2-nvenc

This file is provided by the nvidia-l4t-init package:

$ dpkg -S /etc/nvidia-container-runtime/host-files-for-container.d/devices.csv nvidia-l4t-init: /etc/nvidia-container-runtime/host-files-for-container.d/devices.csv

Note that this includes /dev/nvidia0 and /dev/nvidiactl for this system. In the case of Thor-systems, this would include /dev/nvidia0, /dev/nvidia1, and /dev/nvidiactl.

For the purpose of this discussion then, the primary difference between the device nodes for the two devices are that the dGPU includes the /dev/nvidia-uvm and /dev/nvidia-uvm-tools devices that are required for actually running CUDA applications. On a Thor-based system using nvgpu, the container also needs access to the OTHER device nodes mentioned in the CSV file. We currently include all of them, but this list could probably be reduced.

Also note that on a Thor-based system that includes a dGPU, the second (rendering) device node for the iGPU is /dev/nvidia2 and NOT /dev/nvidia1.

cdesiniotis · 2026-02-22T23:27:15Z

pkg/nvcdi/lib-csv.go

+			// device level.
+			additionalDiscoverers: []discover.Discover{
+				(*nvmllib)(l).controlDeviceNodeDiscoverer(),
+				csvDeviceNodeDiscoverer,


Question -- Conceptually speaking, why do we have to add the csvDeviceNodeDiscoverer here? I ask sicne the fullGPUDeviceSpecGenerator will, by default, construct and use a device node discoverer here.

We add this because in addition to the "Standard" dGPU device nodes that are returned by the fullGPUDIscovererer that we construct as linked, we ALSO need access to (at least some of) the device nodes defined in the CSV file. The csvDeviceNodeDiscoverer in this case should be filtering out the specific device nodes (e.g. /dev/nvidia0 and /dev/nvidia2) associated with the iGPU.

elezar added this to the v1.19.0 milestone Nov 17, 2025

elezar self-assigned this Nov 17, 2025

elezar requested a review from cdesiniotis November 17, 2025 15:10

elezar force-pushed the dgpu-on-nvgpu branch from 2f9fcb8 to 57ef289 Compare November 17, 2025 15:16

This was referenced Nov 18, 2025

Use requested devices for CSV CDI spec generation #1464

Merged

Support iGPU and dGPU device selection on IGX 2.0 systems #1465

Open

elezar requested a review from ArangoGutierrez November 25, 2025 13:58

ArangoGutierrez reviewed Nov 26, 2025

View reviewed changes

internal/platform-support/tegra/csv.go Outdated Show resolved Hide resolved

pkg/nvcdi/lib-csv.go Show resolved Hide resolved

elezar mentioned this pull request Nov 28, 2025

Resolve a tegra platform for a single iGPU NVIDIA/go-nvlib#78

Merged

elezar force-pushed the dgpu-on-nvgpu branch 6 times, most recently from b12a6f1 to b827872 Compare December 3, 2025 13:02

elezar force-pushed the dgpu-on-nvgpu branch from b827872 to dbe9e33 Compare December 3, 2025 13:50

ArangoGutierrez previously approved these changes Dec 4, 2025

View reviewed changes

elezar force-pushed the dgpu-on-nvgpu branch 2 times, most recently from b416704 to a8d7b65 Compare December 4, 2025 13:25

ArangoGutierrez self-requested a review December 4, 2025 13:46

elezar added the tegra label Dec 8, 2025

elezar added 6 commits December 8, 2025 11:05

[no-relnote] Move tegra.tegraOptions to separate file

37c4cb0

Signed-off-by: Evan Lezar <elezar@nvidia.com>

[no-relnote] Rename tegra.tegraOptions to tegra.options

8817208

Signed-off-by: Evan Lezar <elezar@nvidia.com>

[no-relnote] Introduce tegra.MountSpecPathsByType for refactoring

4784886

Signed-off-by: Evan Lezar <elezar@nvidia.com>

[no-relnote] Specify mountSpecs instead of CSV files

d810c8b

Signed-off-by: Evan Lezar <elezar@nvidia.com>

[no-relnote] Remove ignored symlink patterns at construction

ddb320e

Signed-off-by: Evan Lezar <elezar@nvidia.com>

[no-relnote] Add collection of mountSpecs

f015630

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar added 3 commits December 8, 2025 11:05

Bump github.com/NVIDIA/go-nvlib from 0.8.1 to d0f42ba016dd

39644f9

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the dgpu-on-nvgpu branch from a8d7b65 to 4eb4ca6 Compare December 8, 2025 10:06

ArangoGutierrez approved these changes Dec 8, 2025

View reviewed changes

elezar merged commit 923fa9b into NVIDIA:main Dec 8, 2025
16 checks passed

elezar deleted the dgpu-on-nvgpu branch December 8, 2025 13:27

elezar mentioned this pull request Dec 9, 2025

Allow OpenRM drivers to be discovered in CSV mode #1315

Closed

elezar mentioned this pull request Jan 6, 2026

Ensure that the CSV files define the required devices when only a iGPU is present #1556

Merged

cdesiniotis reviewed Feb 22, 2026

View reviewed changes

elezar mentioned this pull request Feb 23, 2026

chore: Rename isIntegratedGPUID function to isOrinGPUID #1674

Open

Conversation

elezar commented Nov 17, 2025

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coveralls commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20024248479

Details

💛 - Coveralls

Uh oh!

elezar commented Dec 3, 2025

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Dec 3, 2025 •

edited

Loading

cdesiniotis Feb 22, 2026 •

edited

Loading