Skip to content

Add e2e test for NVIDIA device plugin DaemonSet deployment with Renovate auto-update#7984

Open
ganeshkumarashok wants to merge 4 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e-renovate
Open

Add e2e test for NVIDIA device plugin DaemonSet deployment with Renovate auto-update#7984
ganeshkumarashok wants to merge 4 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e-renovate

Conversation

@ganeshkumarashok
Copy link
Contributor

Summary

  • Add a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates the NVIDIA device plugin works when deployed as a Kubernetes DaemonSet instead of the systemd service
  • This tests the "upstream" deployment model used by customers who manage their own device plugin deployment
  • Add Renovate configuration to auto-update the container image version in e2e test files

Changes

  1. New e2e test (e2e/scenario_gpu_daemonset_test.go):

    • Tests GPU node with device plugin as DaemonSet (not systemd)
    • Validates systemd service is NOT running
    • Deploys NVIDIA device plugin DaemonSet to the test node
    • Validates GPU resources are advertised
    • Validates GPU workloads can be scheduled
    • Uses mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2
  2. Renovate auto-update (.github/renovate.json):

    • Add custom manager for e2e Go files to detect // renovate: datasource=docker depName=... comments
    • Add MCR device plugin image to nvidia-device-plugin package group for consistent reviewers

Test plan

  • Run Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset e2e test
  • Verify Renovate can detect the image version (check Renovate logs)

Add a new e2e test that validates the NVIDIA device plugin works when
deployed as a Kubernetes DaemonSet instead of the systemd service.
This tests the "upstream" deployment model used by customers who
manage their own device plugin deployment.

Also add Renovate configuration to auto-update the container image
version in e2e test files:
- Add custom manager regex pattern for e2e Go files
- Add package name for MCR device plugin image to nvidia-device-plugin group
@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

PR Title Lint Failed ❌

Current Title: Add e2e test for NVIDIA device plugin DaemonSet deployment with Renovate auto-update

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds coverage in the e2e suite for the “upstream” NVIDIA device plugin deployment model (as a Kubernetes DaemonSet) and configures Renovate to keep the test’s container image tag up to date.

Changes:

  • Introduces a new Ubuntu 22.04 GPU e2e scenario that deploys the NVIDIA device plugin as a DaemonSet and validates GPU advertisement + scheduling.
  • Adds a Renovate custom regex manager to update docker image tags referenced in e2e Go tests via // renovate: comments.
  • Extends the existing Renovate grouping for nvidia-device-plugin to also include the MCR upstream device-plugin image.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
e2e/scenario_gpu_daemonset_test.go New e2e GPU scenario that validates DaemonSet-based NVIDIA device plugin deployment (not systemd).
.github/renovate.json Adds a custom manager to update image tags in e2e Go tests and groups the MCR image with existing NVIDIA device plugin updates.

Comment on lines +84 to +91
// Check if the service exists and is inactive
// Using "is-active" which returns non-zero if not active
result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")
output := strings.TrimSpace(result.stdout)

// The service should either not exist or be inactive
if output == "active" {
s.T.Fatalf("nvidia-device-plugin.service is unexpectedly running - this test requires the systemd service to be disabled")
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemctl is-active ... || echo 'not-running' pattern discards both the real service state (e.g., "inactive", "failed", "activating") and the exit code, so the test can incorrectly pass even if the service is transitioning/running (e.g., "activating"). Capture the actual systemctl is-active output/exit status and fail on any running-like state (active/activating), or use execScriptOnVMForScenarioValidateExitCode with the expected non-active exit codes.

Suggested change
// Check if the service exists and is inactive
// Using "is-active" which returns non-zero if not active
result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")
output := strings.TrimSpace(result.stdout)
// The service should either not exist or be inactive
if output == "active" {
s.T.Fatalf("nvidia-device-plugin.service is unexpectedly running - this test requires the systemd service to be disabled")
// Check the current service state using "is-active".
// This will return "active", "inactive", "failed", "activating", "unknown", etc.
result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null")
output := strings.TrimSpace(result.stdout)
// The service should either not exist or be in a non-running state.
// Treat both "active" and "activating" as failures, since the service
// must not be running when validating the DaemonSet-based deployment.
if output == "active" || output == "activating" {
s.T.Fatalf("nvidia-device-plugin.service is unexpectedly %s - this test requires the systemd service to be disabled", output)

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +104
maxLen := 63
name := prefix + nodeName
if len(name) > maxLen {
name = name[:maxLen]
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidiaDevicePluginDaemonsetName truncates the name to 63 chars but doesn’t ensure the truncated result still conforms to DNS label rules (must end with an alphanumeric). If truncation cuts on a '-', Kubernetes can reject the object/label value. Mirror the existing truncatePodName behavior by trimming trailing '-' after truncation (and apply any needed sanitization).

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +209
_ = s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
deleteCtx,
ds.Name,
metav1.DeleteOptions{},
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test deletes an existing DaemonSet and then recreates it without waiting for deletion to finish. If a prior run left it terminating, the create can fail with AlreadyExists / "object is being deleted" and make the test flaky. Consider polling until a GET returns NotFound (or using an update-based approach) before creating.

Copilot uses AI. Check for mistakes.
- Add nvidia k8s-device-plugin container image to GPUContainerImages
  section in components.json for Renovate auto-updates
- Add GetGPUContainerImage helper in e2e/components to read GPU
  container image versions from components.json
- Update e2e test to read version from components.json instead of
  hardcoding
- Update Renovate package rule to match the new package name pattern
@github-actions github-actions bot added the components This pull request updates cached components on Linux or Windows VHDs label Feb 27, 2026
- Add new E2EContainerImages section in components.json for container
  images used only in e2e tests (not cached on VHD)
- Add schema definition for E2EContainerImages in components.cue
- Add GetE2EContainerImage helper to read from E2EContainerImages
- Move nvidia k8s-device-plugin from GPUContainerImages to
  E2EContainerImages to prevent VHD caching while enabling Renovate
  auto-updates
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment on lines 382 to 385
"matchPackageNames": [
"nvidia-device-plugin"
"nvidia-device-plugin",
"oss/v2/nvidia/k8s-device-plugin"
],
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says a new Renovate custom manager was added to auto-update container image versions in e2e Go test files (via // renovate: comments), but this renovate.json change only updates package grouping. There is still no customManagers entry that targets e2e/**/*.go or parses // renovate: lines, so Renovate will not update Go test image strings as described. Either add the intended custom manager configuration or update the PR description to match the actual approach (version coming from components.json).

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +177
jsonBytes, err := os.ReadFile(componentsPath)
if err != nil {
return ""
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetE2EContainerImage returns an empty string if reading/parsing components.json fails or the image entry is missing. In practice this causes later Kubernetes errors like “invalid image name” and hides the real root cause. Consider returning (string, error) (or failing fast) and have callers require.NoError/require.NotEmpty so test failures point to the missing/misconfigured components.json entry.

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +162
Name: "nvidia-device-plugin-ctr",
Image: getNvidiaDevicePluginImage(),
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DaemonSet uses Image: getNvidiaDevicePluginImage() without asserting it’s non-empty/valid. If components.json lookup fails, the pod will be created with an empty image and the test will fail with a confusing scheduling/pull error. Add an early require.NotEmpty (or propagate an error from GetE2EContainerImage) so failures clearly indicate a missing/misconfigured E2EContainerImages entry.

Suggested change
Name: "nvidia-device-plugin-ctr",
Image: getNvidiaDevicePluginImage(),
Name: "nvidia-device-plugin-ctr",
Image: func() string {
img := getNvidiaDevicePluginImage()
require.NotEmpty(t, img, "nvidia device plugin image must be configured in E2EContainerImages (components.json entry %q)", nvidiaDevicePluginImageName)
return img
}(),

Copilot uses AI. Check for mistakes.
Comment on lines +2078 to +2081
"version": {
"renovateTag": "registry=https://mcr.microsoft.com, name=oss/v2/nvidia/k8s-device-plugin",
"latestVersion": "v0.18.2-1"
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the test uses mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2, but components.json pins latestVersion to v0.18.2-1. Please reconcile the version/tag format (either update the description or adjust latestVersion) to avoid confusion and to ensure the referenced tag is the one actually being tested.

Copilot uses AI. Check for mistakes.
// NVIDIA device plugin deployment.
func Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset(t *testing.T) {
RunScenario(t, &Scenario{
Description: "Tests that NVIDIA device plugin works when deployed as a DaemonSet (not systemd service)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just add the device plugin daemonset to all the exisiting unmanaged scenarios? instead of adding a new one?

…204_GPUA10

Instead of a separate e2e test, inline the NVIDIA device plugin DaemonSet
deployment and validation into two existing regular managed GPU scenarios.
This avoids spinning up an additional GPU VM while still covering the
upstream DaemonSet deployment model on two different GPU SKUs (NC6s_v3
and NV6ads_A10_v5).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants