Skip to content

fix: nvidia-container-toolkit 1.18.0 jit-CDI mode, add nvidia-cdi-refresh is enabled validator for all AKS GPU sku#7963

Open
sulixu wants to merge 1 commit intomainfrom
ctk-e2e
Open

fix: nvidia-container-toolkit 1.18.0 jit-CDI mode, add nvidia-cdi-refresh is enabled validator for all AKS GPU sku#7963
sulixu wants to merge 1 commit intomainfrom
ctk-e2e

Conversation

@sulixu
Copy link

@sulixu sulixu commented Feb 26, 2026

What this PR does / why we need it:

NVIDIA Container Toolkit  1.17.x -> 1.18.x breaking change
NVIDIA Container Toolkit 1.18.0, this release of the NVIDIA Container Toolkit v1.18.0 is feature release with the following high-level changes:
The default mode of the NVIDIA Container Runtime has been updated to make use of a just-in-time-generated CDI specification instead of defaulting to the legacy mode.
Added a systemd unit to generate CDI specifications for available devices automatically. This allows native CDI support in container engines such as Docker and Podman to be used without additional steps.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.18.0/release-notes.html

but we dont even invoke nvidia-cdi-hook in AKS-GPU repo: https://github.com/Azure/aks-gpu/blob/main/config.sh#L4
nvidia-cdi-hook ships inside the nvidia-container-toolkit-base

adding nvidia-cdi-refresh is enabled validator for all AKS GPU sku.

without Azure/aks-gpu#136

all the non-managed AKS GPU related test should fail.

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings February 26, 2026 17:53
@sulixu sulixu changed the title add nvidia-cdi-refresh is enabled validator for all AKS GPU sku fix: nvidia-container-toolkit 1.18.0 jit-CDI mode, add nvidia-cdi-refresh is enabled validator for all AKS GPU sku Feb 26, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an E2E validator to confirm NVIDIA’s new nvidia-cdi-refresh systemd units are enabled and the last run succeeded, in response to NVIDIA Container Toolkit’s shift toward JIT-generated CDI specs.

Changes:

  • Introduces ValidateNvidiaCdiRefreshServiceRunning in e2e/validators.go.
  • Wires the new validator into multiple GPU scenarios (Ubuntu 22.04/24.04 GRID & GPU, Azure Linux v3 GPU) and the GPU NPD scenario helper.
  • Refactors ValidateNvidiaDevicePluginServiceRunning into a dedicated function block (no behavior change intended).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
e2e/validators.go Adds the nvidia-cdi-refresh systemd validation and refactors the device-plugin validator block.
e2e/test_helpers.go Adds the new CDI refresh validation to the GPU NPD scenario helper validator chain.
e2e/scenario_test.go Adds the new CDI refresh validation to multiple GPU scenario validators.
Comments suppressed due to low confidence (2)

e2e/validators.go:1461

  • Indentation in this newly added function uses spaces instead of the gofmt-standard tabs used throughout the file. Please run gofmt (or otherwise format this block) to keep repository formatting consistent and avoid CI/lint diffs.
func ValidateNvidiaDevicePluginServiceRunning(ctx context.Context, s *Scenario) {
    s.T.Helper()
    s.T.Logf("validating that NVIDIA device plugin systemd service is running")

    command := []string{
        "set -ex",
        "systemctl is-active nvidia-device-plugin.service",
        "systemctl is-enabled nvidia-device-plugin.service",
    }
    execScriptOnVMForScenarioValidateExitCode(ctx, s, strings.Join(command, "\n"), 0, "NVIDIA device plugin systemd service should be active and enabled")

e2e/scenario_test.go:1010

  • The PR title/description says this validator should apply to all AKS GPU SKUs, but this change only wires the new validation into scenario_test.go and the GPUNPD helper. GPU managed-experience scenarios in e2e/scenario_gpu_managed_experience_test.go (e.g. Ubuntu2404/Ubuntu2204/AzureLinux3 NvidiaDevicePluginRunning) still won't exercise ValidateNvidiaCdiRefreshServiceRunning, so coverage is incomplete unless those are updated or the scope/title is narrowed.
		Validator: func(ctx context.Context, s *Scenario) {
			// Ensure nvidia-modprobe install does not restart kubelet and temporarily cause node to be unschedulable
			ValidateNvidiaModProbeInstalled(ctx, s)
			ValidateKubeletHasNotStopped(ctx, s)
			ValidateServicesDoNotRestartKubelet(ctx, s)
			ValidateNvidiaCdiRefreshServiceRunning(ctx, s)
		},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants