test: add e2e test for NVIDIA device plugin as DaemonSet#7964

Open

ganeshkumarashok wants to merge 5 commits intomainfrom

aganeshkumar/nvidia-device-plugin-daemonset-e2e

Contributor

ganeshkumarashok commented Feb 26, 2026

Summary

Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service
This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment

Test Details

The test Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset:

Provisions a GPU node (Standard_NV6ads_A10_v5) with GPU drivers enabled but systemd device plugin disabled
Deploys nvidia-device-plugin:v0.18.2 from MCR (mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin) as a DaemonSet
Validates:
- GPU resources are advertised by the device plugin
- GPU workloads can be scheduled on the node

Test plan

CI pipeline runs the new GPU e2e test
Verify the DaemonSet-based device plugin properly registers GPU resources
Verify GPU workloads can be scheduled


          test: add e2e test for NVIDIA device plugin as DaemonSet

Add a new e2e test that validates GPU nodes work correctly when the
NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of
a systemd service. This tests the upstream deployment model commonly
used by customers who manage their own device plugin deployment.

The test:
- Provisions a GPU node with drivers but without systemd device plugin
- Deploys nvidia-device-plugin v0.18.2 as a DaemonSet from MCR
- Validates GPU resources are advertised and workloads can be scheduled

Copilot AI review requested due to automatic review settings

February 26, 2026 17:55

ganeshkumarashok requested review from AbelHu, Devinwong, awesomenix, calvin197, cameronmeissner, djsly, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners

February 26, 2026 17:55

ganeshkumarashok temporarily deployed to test

February 26, 2026 17:56

— with

GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok

February 26, 2026 17:57

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This pull request adds a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment.

Changes:

Adds a new GPU e2e test that provisions a Standard_NV6ads_A10_v5 GPU node with drivers enabled but systemd device plugin disabled
Deploys nvidia-device-plugin:v0.18.2 from MCR as a DaemonSet
Validates GPU resource advertisement and workload scheduling with DaemonSet-based device plugin

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

sulixu approved these changes

View reviewed changes


          Update e2e/scenario_gpu_daemonset_test.go

941241a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings

February 26, 2026 19:22

ganeshkumarashok temporarily deployed to test

February 26, 2026 19:22

— with

GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok

February 26, 2026 19:24

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

e2e/scenario_gpu_daemonset_test.go Outdated

Comment on lines +181 to +189

+              		if pod.Status.Phase == corev1.PodRunning {
+              			// Check if all containers are ready
+              			for _, containerStatus := range pod.Status.ContainerStatuses {
+              				if !containerStatus.Ready {
+              					s.T.Logf("Container %s is not ready yet", containerStatus.Name)
+              					return false, nil
+              				}
+              			}
+              			return true, nil

Copilot AI Feb 26, 2026

The wait loop checks if any containers in the pod are not ready, but it only logs which container is not ready without failing immediately on errors like CrashLoopBackOff or ImagePullBackOff. If the pod enters a failed state, the test will wait the full 3 minutes before timing out.

Consider checking pod.Status.ContainerStatuses[].State.Waiting.Reason for error states (ImagePullBackOff, CrashLoopBackOff, etc.) and returning an error immediately rather than continuing to poll. This would make test failures faster and provide clearer error messages.

Copilot uses AI. Check for mistakes.

e2e/scenario_gpu_daemonset_test.go Outdated

Comment on lines +117 to +120

+              								AllowPrivilegeEscalation: to.Ptr(false),
+              								Capabilities: &corev1.Capabilities{
+              									Drop: []corev1.Capability{"ALL"},
+              								},

Copilot AI Feb 26, 2026

The SecurityContext configuration may prevent the device plugin from functioning correctly. The NVIDIA device plugin needs to interact with GPU devices and the kubelet device plugin socket, which typically requires either:

Privileged mode (Privileged: true), OR
Specific capabilities like SYS_ADMIN

The current configuration only drops all capabilities without granting the necessary ones. The upstream NVIDIA device plugin DaemonSet spec typically does NOT use AllowPrivilegeEscalation: false with dropped capabilities - it usually runs in privileged mode or with specific capabilities.

Without proper privileges, the device plugin won't be able to enumerate GPUs or create device plugin sockets. This could cause the test to fail when it tries to validate GPU resources are advertised. Verify this configuration works, or update it to match the upstream privileged configuration.

Suggested change

      
            								AllowPrivilegeEscalation: to.Ptr(false),
          
            								Capabilities: &corev1.Capabilities{
          
            									Drop: []corev1.Capability{"ALL"},
          
            								},
          
            								Privileged: to.Ptr(true),

Copilot uses AI. Check for mistakes.

e2e/scenario_gpu_daemonset_test.go Outdated

+              					Containers: []corev1.Container{
+              						{
+              							Name:  "nvidia-device-plugin-ctr",
+              							Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",

Copilot AI Feb 26, 2026

The image tag v0.18.2 does not match the version format used in parts/common/components.json (0.18.2-ubuntu22.04u1). While MCR may support the simplified v0.18.2 tag, it's better to use the exact version that aligns with what's cached on VHDs and referenced in components.json for consistency. This ensures the test validates the same version that production nodes would use.

The components.json shows nvidia-device-plugin versions like "0.18.2-ubuntu22.04u1" for Ubuntu 22.04. Consider using the full version tag or documenting why the simplified tag is intentional.

Copilot generated this review using guidance from repository custom instructions.

e2e/scenario_gpu_daemonset_test.go

+              	ds := nvidiaDevicePluginDaemonset(s.Runtime.VM.KubeName)
+              	err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
+              	require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")

Copilot AI Feb 26, 2026

The DaemonSet is created but never explicitly cleaned up. While Kubernetes may garbage collect when the test cluster is torn down, it's better practice to add explicit cleanup to prevent resource leaks if the test fails before cluster cleanup. E2E tests should be self-contained and clean up resources they create.

Consider using s.T.Cleanup() or defer to delete the DaemonSet at the end of the validator function, similar to how other e2e tests handle resource cleanup.

Suggested change

      
            	// Ensure the DaemonSet is cleaned up after the test finishes
          
            	s.T.Cleanup(func() {
          
            		s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name)
          
            		if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
          
            			context.Background(),
          
            			ds.Name,
          
            			metav1.DeleteOptions{},
          
            		); err != nil {
          
            			s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)
          
            		}
          
            	})

Copilot uses AI. Check for mistakes.

e2e/scenario_gpu_daemonset_test.go

Comment on lines +106 to +121

+              					Containers: []corev1.Container{
+              						{
+              							Name:  "nvidia-device-plugin-ctr",
+              							Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
+              							Env: []corev1.EnvVar{
+              								{
+              									Name:  "FAIL_ON_INIT_ERROR",
+              									Value: "false",
+              								},
+              							},
+              							SecurityContext: &corev1.SecurityContext{
+              								AllowPrivilegeEscalation: to.Ptr(false),
+              								Capabilities: &corev1.Capabilities{
+              									Drop: []corev1.Capability{"ALL"},
+              								},
+              							},

Copilot AI Feb 26, 2026

The DaemonSet container spec is missing several configuration options present in the upstream NVIDIA device plugin deployment (https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml). Specifically:

Missing the /dev/null device volume mount which the device plugin may need
The upstream spec typically includes additional environment variables like NVIDIA_MIG_MONITOR_DEVICES, NVIDIA_DRIVER_CAPABILITIES, etc.
The upstream spec sets Privileged: true in SecurityContext rather than just dropping capabilities

While the minimal spec may work for basic testing, using a configuration closer to the upstream deployment would make this test more valuable as it validates the real-world customer scenario. Consider either aligning more closely with the upstream spec or adding a comment explaining why this simplified configuration is sufficient for the test's purpose.

Copilot uses AI. Check for mistakes.

e2e/scenario_gpu_daemonset_test.go

Comment on lines +51 to +57

+              				waitForNvidiaDevicePluginDaemonsetReady(ctx, s)
+              				// Validate that GPU resources are advertised by the device plugin
+              				ValidateNodeAdvertisesGPUResources(ctx, s, 1, "nvidia.com/gpu")
+              				// Validate that GPU workloads can be scheduled
+              				ValidateGPUWorkloadSchedulable(ctx, s, 1)

Copilot AI Feb 26, 2026

After waiting for the device plugin pod to be ready, consider adding validation to check the pod logs for successful GPU discovery and device plugin registration. This would catch issues where the pod starts but the device plugin fails to function correctly (e.g., driver compatibility issues, incorrect configuration).

The logs should contain messages about discovering GPUs and registering with kubelet's device plugin framework. This would make the test more comprehensive and catch edge cases where the pod runs but doesn't actually register GPU resources properly.

Copilot uses AI. Check for mistakes.

surajssd requested changes

View reviewed changes

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

surajssd reviewed

View reviewed changes

e2e/scenario_gpu_daemonset_test.go Show resolved Hide resolved

surajssd reviewed

View reviewed changes

Member

surajssd left a comment •

edited

Loading

nvm


          address PR review comments

12ecdb2

- Use unique DaemonSet name per node to avoid collisions in shared cluster
- Add cleanup to delete DaemonSet when test finishes
- Use Privileged mode matching upstream NVIDIA device plugin spec
- Use existing WaitUntilPodRunning helper instead of custom wait loop
- Add comments explaining image version choice

ganeshkumarashok added 2 commits

February 26, 2026 15:13


          improve test robustness and cleanup

3094f62

- Extract image version to constant for easier updates
- Add validation that systemd device plugin is not running
- Truncate DaemonSet name to 63 chars (K8s limit)
- Add timeout contexts to cleanup operations
- Delete existing DaemonSet before create for idempotency


          fix comments to accurately describe GPU device plugin behavior

54fe957

ganeshkumarashok temporarily deployed to test

February 26, 2026 23:47

— with

GitHub Actions Inactive

surajssd approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

surajssd surajssd approved these changes

sulixu sulixu approved these changes

juan-lee Awaiting requested review from juan-lee juan-lee is a code owner

cameronmeissner Awaiting requested review from cameronmeissner cameronmeissner is a code owner

Devinwong Awaiting requested review from Devinwong Devinwong is a code owner

lilypan26 Awaiting requested review from lilypan26 lilypan26 is a code owner

AbelHu Awaiting requested review from AbelHu AbelHu is a code owner

junjiezhang1997 Awaiting requested review from junjiezhang1997 junjiezhang1997 is a code owner

djsly Awaiting requested review from djsly djsly is a code owner

phealy Awaiting requested review from phealy phealy is a code owner

r2k1 Awaiting requested review from r2k1 r2k1 is a code owner

timmy-wright Awaiting requested review from timmy-wright timmy-wright is a code owner

zachary-bailey Awaiting requested review from zachary-bailey zachary-bailey is a code owner

awesomenix Awaiting requested review from awesomenix awesomenix is a code owner

mxj220 Awaiting requested review from mxj220 mxj220 is a code owner

pdamianov-dev Awaiting requested review from pdamianov-dev pdamianov-dev is a code owner

calvin197 Awaiting requested review from calvin197 calvin197 is a code owner

Labels

None yet