feat: azurelinux add nvidia vgpu driver installation selection#7986
feat: azurelinux add nvidia vgpu driver installation selection#7986
Conversation
|
Waiting for nvidia-vgpu-guest-driver to land in PMC. |
draft drid gpu driver selection logic for azurelinux signed-off-by: <mitchzhu@microsoft.com>
8895c0f to
f869d83
Compare
There was a problem hiding this comment.
Pull request overview
Adds NVIDIA GRID (vGPU guest) driver selection for converged A10 SKUs on Azure Linux/Mariner by branching GPU driver installation based on NVIDIA_GPU_DRIVER_TYPE, and extends ShellSpec coverage for the routing logic.
Changes:
- Add
downloadGridDrivers()and route converged SKUs (NVIDIA_GPU_DRIVER_TYPE=grid) to GRID installation indownloadGPUDrivers(). - Add ShellSpec tests validating GRID vs CUDA vs CUDA-open routing behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh |
Introduces GRID driver install function and selects GRID vs CUDA driver flow based on NVIDIA_GPU_DRIVER_TYPE. |
spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh |
Adds tests that validate the new selection/routing behavior without performing real downloads. |
| # TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com). | ||
| # Once published, replace with: | ||
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | ||
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | ||
| # dnf_install 30 1 600 ${GRID_PACKAGE} | ||
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | ||
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | ||
| echo "Installing GRID driver: ${grid_rpm}" | ||
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
There was a problem hiding this comment.
downloadGridDrivers is downloading and installing an RPM from a personal GitHub release URL. This introduces an untracked external dependency (not in components.json / not from packages.microsoft.com or packages.aks.azure.com) and bypasses repo signature/renovation controls. Please publish the GRID RPM to an approved repo and source it via the normal package/repoquery + dnf_install flow (or add it to components.json if it must be fetched as an artifact), and include integrity verification consistent with other downloads.
| # TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com). | |
| # Once published, replace with: | |
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | |
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | |
| # dnf_install 30 1 600 ${GRID_PACKAGE} | |
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | |
| echo "Installing GRID driver: ${grid_rpm}" | |
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| # GRID driver RPM is published via approved repos; resolve it via repoquery | |
| # to ensure we install a package that matches the current kernel version. | |
| GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | |
| grep -E "^nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | |
| if [ -z "$GRID_PACKAGE" ]; then | |
| echo "No NVIDIA GRID package found for kernel ${KERNEL_VERSION}" | |
| exit $ERR_MISSING_CUDA_PACKAGE | |
| fi | |
| echo "Installing GRID driver: ${GRID_PACKAGE}" | |
| dnf_install 30 1 600 ${GRID_PACKAGE} || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | ||
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" |
There was a problem hiding this comment.
grid_url hardcodes a specific kernel version (6.6.121.1.1.azl3), while grid_rpm is built from ${KERNEL_VERSION}. If the node kernel differs, the downloaded filename/content won’t match the expected package and installation will fail. Please derive the download URL/filename from ${KERNEL_VERSION} (or select the correct RPM via repoquery once available) so this works across kernel updates.
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | |
| local kernel_version="${KERNEL_VERSION:-$(uname -r | sed 's/-/./g')}" | |
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${kernel_version}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/${grid_rpm}" |
| # Once published, replace with: | ||
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | ||
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | ||
| # dnf_install 30 1 600 ${GRID_PACKAGE} |
There was a problem hiding this comment.
downloadGridDrivers relies on the caller to have set KERNEL_VERSION as a global variable. That hidden coupling makes the function fragile and easier to misuse. Please compute the kernel version inside downloadGridDrivers (or pass it in as a parameter) so the function is self-contained.
| # dnf_install 30 1 600 ${GRID_PACKAGE} | |
| # dnf_install 30 1 600 ${GRID_PACKAGE} | |
| local KERNEL_VERSION | |
| KERNEL_VERSION=$(uname -r) |
| echo "Installing GRID driver: ${grid_rpm}" | ||
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
There was a problem hiding this comment.
This path uses wget, but installDeps for Mariner/AzureLinux doesn’t install a wget package (and the rest of this script generally uses the existing curl-based retry helpers). Unless the base image guarantees wget is present, this will fail at runtime. Please switch to the existing curl helper(s) or ensure the required downloader is installed before this runs.
| echo "Installing GRID driver: ${grid_rpm}" | |
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| local grid_local_path="/tmp/${grid_rpm}" | |
| echo "Downloading GRID driver from ${grid_url} to ${grid_local_path}" | |
| retrycmd_if_failure 5 10 600 curl -fSL "${grid_url}" -o "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| echo "Installing GRID driver: ${grid_local_path}" | |
| dnf_install 5 10 600 "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | ||
| echo "Installing GRID driver: ${grid_rpm}" | ||
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | ||
| } |
There was a problem hiding this comment.
Installing with rpm -ivh is not idempotent (it will fail if the package is already installed) and bypasses the dnf_install retry/timeout/error-handling conventions used elsewhere in this script. Consider using dnf_install from an approved repo, or make the RPM installation idempotent (e.g., upgrade semantics or a pre-check) to avoid failures on reruns.
| downloadGridDrivers() { | ||
| # Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID (vGPU guest) | ||
| # drivers instead of CUDA drivers. These sizes use a "converged" driver to support | ||
| # both CUDA and GRID workloads — installing vanilla CUDA drivers will fail. | ||
| # | ||
| # TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com). | ||
| # Once published, replace with: | ||
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | ||
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | ||
| # dnf_install 30 1 600 ${GRID_PACKAGE} | ||
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | ||
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | ||
| echo "Installing GRID driver: ${grid_rpm}" | ||
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | ||
| } | ||
|
|
There was a problem hiding this comment.
downloadGridDrivers downloads and installs a kernel driver RPM directly from a personal GitHub release URL without any integrity verification or use of a trusted package repository. An attacker who compromises or controls that GitHub repository or tag could replace the RPM and gain arbitrary code execution as root on every node that runs this provisioning logic. Use a trusted package source (e.g., PMC/dnf) or at minimum verify a strong checksum or cryptographic signature of the RPM before installation, and avoid pinning to a mutable tag or user-owned repository for production driver distribution.
What this PR does / why we need it:
Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID vGPU guest drivers instead of standard CUDA drivers. Previously, AzureLinux 3.0 had no GRID driver support and cannot support these sizes. This PR adds azurelinux GRID driver installation logic, routing converged sizes to the GRID driver path based on NVIDIA_GPU_DRIVER_TYPE while leaving the existing cuda/cuda-open selection unchanged for all other GPU SKUs.
Which issue(s) this PR fixes:
Fixes #
Validation: