Skip to content

feat: azurelinux add nvidia vgpu driver installation selection#7986

Draft
miz060 wants to merge 1 commit intomainfrom
mitchzhu/azl-grid-gpu_driver-pr
Draft

feat: azurelinux add nvidia vgpu driver installation selection#7986
miz060 wants to merge 1 commit intomainfrom
mitchzhu/azl-grid-gpu_driver-pr

Conversation

@miz060
Copy link
Member

@miz060 miz060 commented Feb 27, 2026

What this PR does / why we need it:
Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID vGPU guest drivers instead of standard CUDA drivers. Previously, AzureLinux 3.0 had no GRID driver support and cannot support these sizes. This PR adds azurelinux GRID driver installation logic, routing converged sizes to the GRID driver path based on NVIDIA_GPU_DRIVER_TYPE while leaving the existing cuda/cuda-open selection unchanged for all other GPU SKUs.

Which issue(s) this PR fixes:

Fixes #

Validation:

Copilot AI review requested due to automatic review settings February 27, 2026 20:59
@miz060
Copy link
Member Author

miz060 commented Feb 27, 2026

Waiting for nvidia-vgpu-guest-driver to land in PMC.

draft drid gpu driver selection logic for azurelinux

signed-off-by:  <mitchzhu@microsoft.com>
@miz060 miz060 force-pushed the mitchzhu/azl-grid-gpu_driver-pr branch from 8895c0f to f869d83 Compare February 27, 2026 21:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NVIDIA GRID (vGPU guest) driver selection for converged A10 SKUs on Azure Linux/Mariner by branching GPU driver installation based on NVIDIA_GPU_DRIVER_TYPE, and extends ShellSpec coverage for the routing logic.

Changes:

  • Add downloadGridDrivers() and route converged SKUs (NVIDIA_GPU_DRIVER_TYPE=grid) to GRID installation in downloadGPUDrivers().
  • Add ShellSpec tests validating GRID vs CUDA vs CUDA-open routing behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Introduces GRID driver install function and selects GRID vs CUDA driver flow based on NVIDIA_GPU_DRIVER_TYPE.
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Adds tests that validate the new selection/routing behavior without performing real downloads.

Comment on lines +127 to +135
# TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com).
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers is downloading and installing an RPM from a personal GitHub release URL. This introduces an untracked external dependency (not in components.json / not from packages.microsoft.com or packages.aks.azure.com) and bypasses repo signature/renovation controls. Please publish the GRID RPM to an approved repo and source it via the normal package/repoquery + dnf_install flow (or add it to components.json if it must be fetched as an artifact), and include integrity verification consistent with other downloads.

Suggested change
# TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com).
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
# GRID driver RPM is published via approved repos; resolve it via repoquery
# to ensure we install a package that matches the current kernel version.
GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
grep -E "^nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
if [ -z "$GRID_PACKAGE" ]; then
echo "No NVIDIA GRID package found for kernel ${KERNEL_VERSION}"
exit $ERR_MISSING_CUDA_PACKAGE
fi
echo "Installing GRID driver: ${GRID_PACKAGE}"
dnf_install 30 1 600 ${GRID_PACKAGE} || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +133
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grid_url hardcodes a specific kernel version (6.6.121.1.1.azl3), while grid_rpm is built from ${KERNEL_VERSION}. If the node kernel differs, the downloaded filename/content won’t match the expected package and installation will fail. Please derive the download URL/filename from ${KERNEL_VERSION} (or select the correct RPM via repoquery once available) so this works across kernel updates.

Suggested change
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
local kernel_version="${KERNEL_VERSION:-$(uname -r | sed 's/-/./g')}"
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${kernel_version}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/${grid_rpm}"

Copilot uses AI. Check for mistakes.
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers relies on the caller to have set KERNEL_VERSION as a global variable. That hidden coupling makes the function fragile and easier to misuse. Please compute the kernel version inside downloadGridDrivers (or pass it in as a parameter) so the function is self-contained.

Suggested change
# dnf_install 30 1 600 ${GRID_PACKAGE}
# dnf_install 30 1 600 ${GRID_PACKAGE}
local KERNEL_VERSION
KERNEL_VERSION=$(uname -r)

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +135
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path uses wget, but installDeps for Mariner/AzureLinux doesn’t install a wget package (and the rest of this script generally uses the existing curl-based retry helpers). Unless the base image guarantees wget is present, this will fail at runtime. Please switch to the existing curl helper(s) or ensure the required downloader is installed before this runs.

Suggested change
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
local grid_local_path="/tmp/${grid_rpm}"
echo "Downloading GRID driver from ${grid_url} to ${grid_local_path}"
retrycmd_if_failure 5 10 600 curl -fSL "${grid_url}" -o "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
echo "Installing GRID driver: ${grid_local_path}"
dnf_install 5 10 600 "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT

Copilot uses AI. Check for mistakes.
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing with rpm -ivh is not idempotent (it will fail if the package is already installed) and bypasses the dnf_install retry/timeout/error-handling conventions used elsewhere in this script. Consider using dnf_install from an approved repo, or make the RPM installation idempotent (e.g., upgrade semantics or a pre-check) to avoid failures on reruns.

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +137
downloadGridDrivers() {
# Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID (vGPU guest)
# drivers instead of CUDA drivers. These sizes use a "converged" driver to support
# both CUDA and GRID workloads — installing vanilla CUDA drivers will fail.
#
# TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com).
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
}

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers downloads and installs a kernel driver RPM directly from a personal GitHub release URL without any integrity verification or use of a trusted package repository. An attacker who compromises or controls that GitHub repository or tag could replace the RPM and gain arbitrary code execution as root on every node that runs this provisioning logic. Use a trusted package source (e.g., PMC/dnf) or at minimum verify a strong checksum or cryptographic signature of the RPM before installation, and avoid pinning to a mutable tag or user-owned repository for production driver distribution.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants