Skip to content

test: add e2e test for NVIDIA device plugin as DaemonSet#7964

Open
ganeshkumarashok wants to merge 5 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e
Open

test: add e2e test for NVIDIA device plugin as DaemonSet#7964
ganeshkumarashok wants to merge 5 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e

Conversation

@ganeshkumarashok
Copy link
Contributor

Summary

  • Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service
  • This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment

Test Details

The test Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset:

  • Provisions a GPU node (Standard_NV6ads_A10_v5) with GPU drivers enabled but systemd device plugin disabled
  • Deploys nvidia-device-plugin:v0.18.2 from MCR (mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin) as a DaemonSet
  • Validates:
    • GPU resources are advertised by the device plugin
    • GPU workloads can be scheduled on the node

Test plan

  • CI pipeline runs the new GPU e2e test
  • Verify the DaemonSet-based device plugin properly registers GPU resources
  • Verify GPU workloads can be scheduled

Add a new e2e test that validates GPU nodes work correctly when the
NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of
a systemd service. This tests the upstream deployment model commonly
used by customers who manage their own device plugin deployment.

The test:
- Provisions a GPU node with drivers but without systemd device plugin
- Deploys nvidia-device-plugin v0.18.2 as a DaemonSet from MCR
- Validates GPU resources are advertised and workloads can be scheduled
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment.

Changes:

  • Adds a new GPU e2e test that provisions a Standard_NV6ads_A10_v5 GPU node with drivers enabled but systemd device plugin disabled
  • Deploys nvidia-device-plugin:v0.18.2 from MCR as a DaemonSet
  • Validates GPU resource advertisement and workload scheduling with DaemonSet-based device plugin

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Comment on lines +181 to +189
if pod.Status.Phase == corev1.PodRunning {
// Check if all containers are ready
for _, containerStatus := range pod.Status.ContainerStatuses {
if !containerStatus.Ready {
s.T.Logf("Container %s is not ready yet", containerStatus.Name)
return false, nil
}
}
return true, nil
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait loop checks if any containers in the pod are not ready, but it only logs which container is not ready without failing immediately on errors like CrashLoopBackOff or ImagePullBackOff. If the pod enters a failed state, the test will wait the full 3 minutes before timing out.

Consider checking pod.Status.ContainerStatuses[].State.Waiting.Reason for error states (ImagePullBackOff, CrashLoopBackOff, etc.) and returning an error immediately rather than continuing to poll. This would make test failures faster and provide clearer error messages.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +120
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SecurityContext configuration may prevent the device plugin from functioning correctly. The NVIDIA device plugin needs to interact with GPU devices and the kubelet device plugin socket, which typically requires either:

  1. Privileged mode (Privileged: true), OR
  2. Specific capabilities like SYS_ADMIN

The current configuration only drops all capabilities without granting the necessary ones. The upstream NVIDIA device plugin DaemonSet spec typically does NOT use AllowPrivilegeEscalation: false with dropped capabilities - it usually runs in privileged mode or with specific capabilities.

Without proper privileges, the device plugin won't be able to enumerate GPUs or create device plugin sockets. This could cause the test to fail when it tries to validate GPU resources are advertised. Verify this configuration works, or update it to match the upstream privileged configuration.

Suggested change
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
Privileged: to.Ptr(true),

Copilot uses AI. Check for mistakes.
Containers: []corev1.Container{
{
Name: "nvidia-device-plugin-ctr",
Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image tag v0.18.2 does not match the version format used in parts/common/components.json (0.18.2-ubuntu22.04u1). While MCR may support the simplified v0.18.2 tag, it's better to use the exact version that aligns with what's cached on VHDs and referenced in components.json for consistency. This ensures the test validates the same version that production nodes would use.

The components.json shows nvidia-device-plugin versions like "0.18.2-ubuntu22.04u1" for Ubuntu 22.04. Consider using the full version tag or documenting why the simplified tag is intentional.

Copilot generated this review using guidance from repository custom instructions.
ds := nvidiaDevicePluginDaemonset(s.Runtime.VM.KubeName)
err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DaemonSet is created but never explicitly cleaned up. While Kubernetes may garbage collect when the test cluster is torn down, it's better practice to add explicit cleanup to prevent resource leaks if the test fails before cluster cleanup. E2E tests should be self-contained and clean up resources they create.

Consider using s.T.Cleanup() or defer to delete the DaemonSet at the end of the validator function, similar to how other e2e tests handle resource cleanup.

Suggested change
// Ensure the DaemonSet is cleaned up after the test finishes
s.T.Cleanup(func() {
s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name)
if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
context.Background(),
ds.Name,
metav1.DeleteOptions{},
); err != nil {
s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)
}
})

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +121
Containers: []corev1.Container{
{
Name: "nvidia-device-plugin-ctr",
Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
Env: []corev1.EnvVar{
{
Name: "FAIL_ON_INIT_ERROR",
Value: "false",
},
},
SecurityContext: &corev1.SecurityContext{
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
},
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DaemonSet container spec is missing several configuration options present in the upstream NVIDIA device plugin deployment (https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml). Specifically:

  1. Missing the /dev/null device volume mount which the device plugin may need
  2. The upstream spec typically includes additional environment variables like NVIDIA_MIG_MONITOR_DEVICES, NVIDIA_DRIVER_CAPABILITIES, etc.
  3. The upstream spec sets Privileged: true in SecurityContext rather than just dropping capabilities

While the minimal spec may work for basic testing, using a configuration closer to the upstream deployment would make this test more valuable as it validates the real-world customer scenario. Consider either aligning more closely with the upstream spec or adding a comment explaining why this simplified configuration is sufficient for the test's purpose.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +57
waitForNvidiaDevicePluginDaemonsetReady(ctx, s)

// Validate that GPU resources are advertised by the device plugin
ValidateNodeAdvertisesGPUResources(ctx, s, 1, "nvidia.com/gpu")

// Validate that GPU workloads can be scheduled
ValidateGPUWorkloadSchedulable(ctx, s, 1)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After waiting for the device plugin pod to be ready, consider adding validation to check the pod logs for successful GPU discovery and device plugin registration. This would catch issues where the pod starts but the device plugin fails to function correctly (e.g., driver compatibility issues, incorrect configuration).

The logs should contain messages about discovering GPUs and registering with kubelet's device plugin framework. This would make the test more comprehensive and catch edge cases where the pod runs but doesn't actually register GPU resources properly.

Copilot uses AI. Check for mistakes.
Copy link
Member

@surajssd surajssd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm

- Use unique DaemonSet name per node to avoid collisions in shared cluster
- Add cleanup to delete DaemonSet when test finishes
- Use Privileged mode matching upstream NVIDIA device plugin spec
- Use existing WaitUntilPodRunning helper instead of custom wait loop
- Add comments explaining image version choice
- Extract image version to constant for easier updates
- Add validation that systemd device plugin is not running
- Truncate DaemonSet name to 63 chars (K8s limit)
- Add timeout contexts to cleanup operations
- Delete existing DaemonSet before create for idempotency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants