test: add e2e test for NVIDIA device plugin as DaemonSet#7964
test: add e2e test for NVIDIA device plugin as DaemonSet#7964ganeshkumarashok wants to merge 5 commits intomainfrom
Conversation
Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment. The test: - Provisions a GPU node with drivers but without systemd device plugin - Deploys nvidia-device-plugin v0.18.2 as a DaemonSet from MCR - Validates GPU resources are advertised and workloads can be scheduled
There was a problem hiding this comment.
Pull request overview
This pull request adds a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment.
Changes:
- Adds a new GPU e2e test that provisions a Standard_NV6ads_A10_v5 GPU node with drivers enabled but systemd device plugin disabled
- Deploys nvidia-device-plugin:v0.18.2 from MCR as a DaemonSet
- Validates GPU resource advertisement and workload scheduling with DaemonSet-based device plugin
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
e2e/scenario_gpu_daemonset_test.go
Outdated
| if pod.Status.Phase == corev1.PodRunning { | ||
| // Check if all containers are ready | ||
| for _, containerStatus := range pod.Status.ContainerStatuses { | ||
| if !containerStatus.Ready { | ||
| s.T.Logf("Container %s is not ready yet", containerStatus.Name) | ||
| return false, nil | ||
| } | ||
| } | ||
| return true, nil |
There was a problem hiding this comment.
The wait loop checks if any containers in the pod are not ready, but it only logs which container is not ready without failing immediately on errors like CrashLoopBackOff or ImagePullBackOff. If the pod enters a failed state, the test will wait the full 3 minutes before timing out.
Consider checking pod.Status.ContainerStatuses[].State.Waiting.Reason for error states (ImagePullBackOff, CrashLoopBackOff, etc.) and returning an error immediately rather than continuing to poll. This would make test failures faster and provide clearer error messages.
e2e/scenario_gpu_daemonset_test.go
Outdated
| AllowPrivilegeEscalation: to.Ptr(false), | ||
| Capabilities: &corev1.Capabilities{ | ||
| Drop: []corev1.Capability{"ALL"}, | ||
| }, |
There was a problem hiding this comment.
The SecurityContext configuration may prevent the device plugin from functioning correctly. The NVIDIA device plugin needs to interact with GPU devices and the kubelet device plugin socket, which typically requires either:
- Privileged mode (Privileged: true), OR
- Specific capabilities like SYS_ADMIN
The current configuration only drops all capabilities without granting the necessary ones. The upstream NVIDIA device plugin DaemonSet spec typically does NOT use AllowPrivilegeEscalation: false with dropped capabilities - it usually runs in privileged mode or with specific capabilities.
Without proper privileges, the device plugin won't be able to enumerate GPUs or create device plugin sockets. This could cause the test to fail when it tries to validate GPU resources are advertised. Verify this configuration works, or update it to match the upstream privileged configuration.
| AllowPrivilegeEscalation: to.Ptr(false), | |
| Capabilities: &corev1.Capabilities{ | |
| Drop: []corev1.Capability{"ALL"}, | |
| }, | |
| Privileged: to.Ptr(true), |
e2e/scenario_gpu_daemonset_test.go
Outdated
| Containers: []corev1.Container{ | ||
| { | ||
| Name: "nvidia-device-plugin-ctr", | ||
| Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2", |
There was a problem hiding this comment.
The image tag v0.18.2 does not match the version format used in parts/common/components.json (0.18.2-ubuntu22.04u1). While MCR may support the simplified v0.18.2 tag, it's better to use the exact version that aligns with what's cached on VHDs and referenced in components.json for consistency. This ensures the test validates the same version that production nodes would use.
The components.json shows nvidia-device-plugin versions like "0.18.2-ubuntu22.04u1" for Ubuntu 22.04. Consider using the full version tag or documenting why the simplified tag is intentional.
| ds := nvidiaDevicePluginDaemonset(s.Runtime.VM.KubeName) | ||
| err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds) | ||
| require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet") | ||
|
|
There was a problem hiding this comment.
The DaemonSet is created but never explicitly cleaned up. While Kubernetes may garbage collect when the test cluster is torn down, it's better practice to add explicit cleanup to prevent resource leaks if the test fails before cluster cleanup. E2E tests should be self-contained and clean up resources they create.
Consider using s.T.Cleanup() or defer to delete the DaemonSet at the end of the validator function, similar to how other e2e tests handle resource cleanup.
| // Ensure the DaemonSet is cleaned up after the test finishes | |
| s.T.Cleanup(func() { | |
| s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name) | |
| if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete( | |
| context.Background(), | |
| ds.Name, | |
| metav1.DeleteOptions{}, | |
| ); err != nil { | |
| s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err) | |
| } | |
| }) |
| Containers: []corev1.Container{ | ||
| { | ||
| Name: "nvidia-device-plugin-ctr", | ||
| Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2", | ||
| Env: []corev1.EnvVar{ | ||
| { | ||
| Name: "FAIL_ON_INIT_ERROR", | ||
| Value: "false", | ||
| }, | ||
| }, | ||
| SecurityContext: &corev1.SecurityContext{ | ||
| AllowPrivilegeEscalation: to.Ptr(false), | ||
| Capabilities: &corev1.Capabilities{ | ||
| Drop: []corev1.Capability{"ALL"}, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
The DaemonSet container spec is missing several configuration options present in the upstream NVIDIA device plugin deployment (https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml). Specifically:
- Missing the
/dev/nulldevice volume mount which the device plugin may need - The upstream spec typically includes additional environment variables like NVIDIA_MIG_MONITOR_DEVICES, NVIDIA_DRIVER_CAPABILITIES, etc.
- The upstream spec sets
Privileged: truein SecurityContext rather than just dropping capabilities
While the minimal spec may work for basic testing, using a configuration closer to the upstream deployment would make this test more valuable as it validates the real-world customer scenario. Consider either aligning more closely with the upstream spec or adding a comment explaining why this simplified configuration is sufficient for the test's purpose.
| waitForNvidiaDevicePluginDaemonsetReady(ctx, s) | ||
|
|
||
| // Validate that GPU resources are advertised by the device plugin | ||
| ValidateNodeAdvertisesGPUResources(ctx, s, 1, "nvidia.com/gpu") | ||
|
|
||
| // Validate that GPU workloads can be scheduled | ||
| ValidateGPUWorkloadSchedulable(ctx, s, 1) |
There was a problem hiding this comment.
After waiting for the device plugin pod to be ready, consider adding validation to check the pod logs for successful GPU discovery and device plugin registration. This would catch issues where the pod starts but the device plugin fails to function correctly (e.g., driver compatibility issues, incorrect configuration).
The logs should contain messages about discovering GPUs and registering with kubelet's device plugin framework. This would make the test more comprehensive and catch edge cases where the pod runs but doesn't actually register GPU resources properly.
- Use unique DaemonSet name per node to avoid collisions in shared cluster - Add cleanup to delete DaemonSet when test finishes - Use Privileged mode matching upstream NVIDIA device plugin spec - Use existing WaitUntilPodRunning helper instead of custom wait loop - Add comments explaining image version choice
- Extract image version to constant for easier updates - Add validation that systemd device plugin is not running - Truncate DaemonSet name to 63 chars (K8s limit) - Add timeout contexts to cleanup operations - Delete existing DaemonSet before create for idempotency
Summary
Test Details
The test
Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset:nvidia-device-plugin:v0.18.2from MCR (mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin) as a DaemonSetTest plan