-
Notifications
You must be signed in to change notification settings - Fork 249
feat: azurelinux add nvidia vgpu driver installation selection #7986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -119,6 +119,22 @@ should_use_nvidia_open_drivers() { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| downloadGridDrivers() { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID (vGPU guest) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # drivers instead of CUDA drivers. These sizes use a "converged" driver to support | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # both CUDA and GRID workloads — installing vanilla CUDA drivers will fail. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Once published, replace with: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # dnf_install 30 1 600 ${GRID_PACKAGE} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+132
to
+133
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | |
| local kernel_version="${KERNEL_VERSION:-$(uname -r | sed 's/-/./g')}" | |
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${kernel_version}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/${grid_rpm}" |
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downloadGridDrivers is downloading and installing an RPM from a personal GitHub release URL. This introduces an untracked external dependency (not in components.json / not from packages.microsoft.com or packages.aks.azure.com) and bypasses repo signature/renovation controls. Please publish the GRID RPM to an approved repo and source it via the normal package/repoquery + dnf_install flow (or add it to components.json if it must be fetched as an artifact), and include integrity verification consistent with other downloads.
| # TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com). | |
| # Once published, replace with: | |
| # GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | |
| # grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | |
| # dnf_install 30 1 600 ${GRID_PACKAGE} | |
| local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm" | |
| local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm" | |
| echo "Installing GRID driver: ${grid_rpm}" | |
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| # GRID driver RPM is published via approved repos; resolve it via repoquery | |
| # to ensure we install a package that matches the current kernel version. | |
| GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \ | |
| grep -E "^nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1) | |
| if [ -z "$GRID_PACKAGE" ]; then | |
| echo "No NVIDIA GRID package found for kernel ${KERNEL_VERSION}" | |
| exit $ERR_MISSING_CUDA_PACKAGE | |
| fi | |
| echo "Installing GRID driver: ${GRID_PACKAGE}" | |
| dnf_install 30 1 600 ${GRID_PACKAGE} || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path uses wget, but installDeps for Mariner/AzureLinux doesn’t install a wget package (and the rest of this script generally uses the existing curl-based retry helpers). Unless the base image guarantees wget is present, this will fail at runtime. Please switch to the existing curl helper(s) or ensure the required downloader is installed before this runs.
| echo "Installing GRID driver: ${grid_rpm}" | |
| dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| local grid_local_path="/tmp/${grid_rpm}" | |
| echo "Downloading GRID driver from ${grid_url} to ${grid_local_path}" | |
| retrycmd_if_failure 5 10 600 curl -fSL "${grid_url}" -o "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT | |
| echo "Installing GRID driver: ${grid_local_path}" | |
| dnf_install 5 10 600 "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT |
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing with rpm -ivh is not idempotent (it will fail if the package is already installed) and bypasses the dnf_install retry/timeout/error-handling conventions used elsewhere in this script. Consider using dnf_install from an approved repo, or make the RPM installation idempotent (e.g., upgrade semantics or a pre-check) to avoid failures on reruns.
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downloadGridDrivers downloads and installs a kernel driver RPM directly from a personal GitHub release URL without any integrity verification or use of a trusted package repository. An attacker who compromises or controls that GitHub repository or tag could replace the RPM and gain arbitrary code execution as root on every node that runs this provisioning logic. Use a trusted package source (e.g., PMC/dnf) or at minimum verify a strong checksum or cryptographic signature of the RPM before installation, and avoid pinning to a mutable tag or user-owned repository for production driver distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downloadGridDriversrelies on the caller to have setKERNEL_VERSIONas a global variable. That hidden coupling makes the function fragile and easier to misuse. Please compute the kernel version insidedownloadGridDrivers(or pass it in as a parameter) so the function is self-contained.