Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 30 additions & 4 deletions parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,22 @@ should_use_nvidia_open_drivers() {
return 0
}

downloadGridDrivers() {
# Converged GPU sizes (NVads_A10_v5, NCads_A10_v4) require NVIDIA GRID (vGPU guest)
# drivers instead of CUDA drivers. These sizes use a "converged" driver to support
# both CUDA and GRID workloads — installing vanilla CUDA drivers will fail.
#
# TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com).
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers relies on the caller to have set KERNEL_VERSION as a global variable. That hidden coupling makes the function fragile and easier to misuse. Please compute the kernel version inside downloadGridDrivers (or pass it in as a parameter) so the function is self-contained.

Suggested change
# dnf_install 30 1 600 ${GRID_PACKAGE}
# dnf_install 30 1 600 ${GRID_PACKAGE}
local KERNEL_VERSION
KERNEL_VERSION=$(uname -r)

Copilot uses AI. Check for mistakes.
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
Comment on lines +132 to +133
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grid_url hardcodes a specific kernel version (6.6.121.1.1.azl3), while grid_rpm is built from ${KERNEL_VERSION}. If the node kernel differs, the downloaded filename/content won’t match the expected package and installation will fail. Please derive the download URL/filename from ${KERNEL_VERSION} (or select the correct RPM via repoquery once available) so this works across kernel updates.

Suggested change
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
local kernel_version="${KERNEL_VERSION:-$(uname -r | sed 's/-/./g')}"
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${kernel_version}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/${grid_rpm}"

Copilot uses AI. Check for mistakes.
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
Comment on lines +127 to +135
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers is downloading and installing an RPM from a personal GitHub release URL. This introduces an untracked external dependency (not in components.json / not from packages.microsoft.com or packages.aks.azure.com) and bypasses repo signature/renovation controls. Please publish the GRID RPM to an approved repo and source it via the normal package/repoquery + dnf_install flow (or add it to components.json if it must be fetched as an artifact), and include integrity verification consistent with other downloads.

Suggested change
# TODO(mitchzhu): GRID driver RPM is not yet available on PMC (packages.microsoft.com).
# Once published, replace with:
# GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
# grep -E "nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
# dnf_install 30 1 600 ${GRID_PACKAGE}
local grid_rpm="nvidia-vgpu-guest-driver-570.195.03-1_${KERNEL_VERSION}.x86_64.rpm"
local grid_url="https://github.com/miz060/AgentBaker/releases/download/grid-driver-v570.195.03/nvidia-vgpu-guest-driver-570.195.03-1_6.6.121.1.1.azl3.x86_64.rpm"
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
# GRID driver RPM is published via approved repos; resolve it via repoquery
# to ensure we install a package that matches the current kernel version.
GRID_PACKAGE=$(dnf repoquery -y --available "nvidia-vgpu-guest-driver*" | \
grep -E "^nvidia-vgpu-guest-driver-[0-9]+.*_${KERNEL_VERSION}" | sort -V | tail -n 1)
if [ -z "$GRID_PACKAGE" ]; then
echo "No NVIDIA GRID package found for kernel ${KERNEL_VERSION}"
exit $ERR_MISSING_CUDA_PACKAGE
fi
echo "Installing GRID driver: ${GRID_PACKAGE}"
dnf_install 30 1 600 ${GRID_PACKAGE} || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +135
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path uses wget, but installDeps for Mariner/AzureLinux doesn’t install a wget package (and the rest of this script generally uses the existing curl-based retry helpers). Unless the base image guarantees wget is present, this will fail at runtime. Please switch to the existing curl helper(s) or ensure the required downloader is installed before this runs.

Suggested change
echo "Installing GRID driver: ${grid_rpm}"
dnf_install 5 10 600 "${grid_url}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
local grid_local_path="/tmp/${grid_rpm}"
echo "Downloading GRID driver from ${grid_url} to ${grid_local_path}"
retrycmd_if_failure 5 10 600 curl -fSL "${grid_url}" -o "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT
echo "Installing GRID driver: ${grid_local_path}"
dnf_install 5 10 600 "${grid_local_path}" || exit $ERR_GPU_DRIVERS_INSTALL_TIMEOUT

Copilot uses AI. Check for mistakes.
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing with rpm -ivh is not idempotent (it will fail if the package is already installed) and bypasses the dnf_install retry/timeout/error-handling conventions used elsewhere in this script. Consider using dnf_install from an approved repo, or make the RPM installation idempotent (e.g., upgrade semantics or a pre-check) to avoid failures on reruns.

Copilot uses AI. Check for mistakes.

Comment on lines +122 to +137
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadGridDrivers downloads and installs a kernel driver RPM directly from a personal GitHub release URL without any integrity verification or use of a trusted package repository. An attacker who compromises or controls that GitHub repository or tag could replace the RPM and gain arbitrary code execution as root on every node that runs this provisioning logic. Use a trusted package source (e.g., PMC/dnf) or at minimum verify a strong checksum or cryptographic signature of the RPM before installation, and avoid pinning to a mutable tag or user-owned repository for production driver distribution.

Copilot uses AI. Check for mistakes.
downloadGPUDrivers() {
# Mariner CUDA rpm name comes in the following format:
#
Expand All @@ -128,15 +144,25 @@ downloadGPUDrivers() {
# 2. NVIDIA OpenRM driver:
# cuda-open-%{nvidia gpu driver version}_%{kernel source version}.%{kernel release version}.{mariner rpm postfix}
#
# Legacy GPUs (T4, V100) require proprietary drivers; A100+ use NVIDIA open drivers.
# VM SKU is retrieved from IMDS to determine which driver to use.
# 3. NVIDIA GRID (vGPU guest) driver for converged GPU sizes:
# nvidia-vgpu-guest-driver-%{version}_%{kernel version}.{mariner rpm postfix}
#
# NVIDIA_GPU_DRIVER_TYPE is set by AgentBaker based on ConvergedGPUDriverSizes map
# in gpu_components.go. Converged sizes get "grid"; all others get "cuda".
# Legacy GPUs (T4, V100) require proprietary CUDA drivers; A100+ use NVIDIA open drivers.
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
VM_SKU=$(get_compute_sku)

# Converged GPU sizes use GRID drivers instead of CUDA drivers
if [ "$NVIDIA_GPU_DRIVER_TYPE" = "grid" ]; then
echo "VM SKU ${VM_SKU} uses NVIDIA GRID driver (converged)"
downloadGridDrivers
return
fi

local driver_ret
should_use_nvidia_open_drivers
driver_ret=$?
# Get VM SKU for logging (export already done by should_use_nvidia_open_drivers)
VM_SKU=$(get_compute_sku)
if [ "$driver_ret" -eq 2 ]; then
echo "Failed to determine GPU driver type"
exit $ERR_MISSING_CUDA_PACKAGE
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,4 +186,81 @@ Describe 'cse_install_mariner.sh'
The status should equal 0
End
End

Describe 'downloadGPUDrivers grid vs cuda selection'
# Tests the routing logic in downloadGPUDrivers():
# NVIDIA_GPU_DRIVER_TYPE="grid" → downloadGridDrivers (converged A10 sizes)
# NVIDIA_GPU_DRIVER_TYPE="cuda" → cuda/cuda-open path (all other GPU sizes)
#
# We mock downloadGridDrivers and the cuda download path to isolate
# the selection logic without triggering actual downloads or exits.

MOCK_VM_SKU=""
get_compute_sku() { echo "$MOCK_VM_SKU"; }

# Track which path was taken
GRID_CALLED=""
downloadGridDrivers() { GRID_CALLED="true"; }

# Mock should_use_nvidia_open_drivers to avoid IMDS dependency
MOCK_OPEN_RET=0
should_use_nvidia_open_drivers() { return "$MOCK_OPEN_RET"; }

# Mock uname to return a kernel version matching our fake package
uname() { echo "6.6.121.1-1.azl3"; }

# Mock dnf repoquery to return fake packages matching both cuda and cuda-open patterns
dnf() {
echo "cuda-open-570.195.03-1_6.6.121.1.1.azl3.x86_64"
echo "cuda-570.195.03-1_6.6.121.1.1.azl3.x86_64"
}

It 'selects GRID driver path when NVIDIA_GPU_DRIVER_TYPE is grid'
NVIDIA_GPU_DRIVER_TYPE="grid"
MOCK_VM_SKU="Standard_NV36ads_A10_v5"
GRID_CALLED=""
When call downloadGPUDrivers
The output should include "NVIDIA GRID driver (converged)"
The variable GRID_CALLED should equal "true"
End

It 'selects GRID driver path for NCads_A10_v4 converged size'
NVIDIA_GPU_DRIVER_TYPE="grid"
MOCK_VM_SKU="Standard_NC8ads_A10_v4"
GRID_CALLED=""
When call downloadGPUDrivers
The output should include "NVIDIA GRID driver (converged)"
The variable GRID_CALLED should equal "true"
End

It 'selects cuda-open path for A100 when NVIDIA_GPU_DRIVER_TYPE is cuda'
NVIDIA_GPU_DRIVER_TYPE="cuda"
MOCK_VM_SKU="Standard_ND96asr_v4"
MOCK_OPEN_RET=0
GRID_CALLED=""
When call downloadGPUDrivers
The output should include "NVIDIA OpenRM driver (cuda-open)"
The variable GRID_CALLED should not equal "true"
End

It 'selects proprietary cuda path for T4 when NVIDIA_GPU_DRIVER_TYPE is cuda'
NVIDIA_GPU_DRIVER_TYPE="cuda"
MOCK_VM_SKU="Standard_NC4as_T4_v3"
MOCK_OPEN_RET=1
GRID_CALLED=""
When call downloadGPUDrivers
The output should include "NVIDIA proprietary driver (cuda)"
The variable GRID_CALLED should not equal "true"
End

It 'does not select GRID path when NVIDIA_GPU_DRIVER_TYPE is empty'
NVIDIA_GPU_DRIVER_TYPE=""
MOCK_VM_SKU="Standard_ND96asr_v4"
MOCK_OPEN_RET=0
GRID_CALLED=""
When call downloadGPUDrivers
The output should not include "NVIDIA GRID driver"
The variable GRID_CALLED should not equal "true"
End
End
End
Loading