From 0dca7e54f05dae2d9123242cd5e5c87439ae8b58 Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Wed, 25 Feb 2026 13:39:46 -0500 Subject: [PATCH 1/3] Add docs for 26.3.0 release Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/cdi.rst | 99 +++++++- gpu-operator/conf.py | 226 ------------------ gpu-operator/getting-started.rst | 21 +- gpu-operator/index.rst | 2 +- .../install-gpu-operator-gov-ready.rst | 18 +- gpu-operator/life-cycle-policy.rst | 49 ++-- .../manifests/output/nri-get-pods-restart.txt | 12 + gpu-operator/platform-support.rst | 103 +++++--- gpu-operator/release-notes.rst | 53 ++++ gpu-operator/versions.json | 5 +- gpu-operator/versions1.json | 4 + 11 files changed, 274 insertions(+), 318 deletions(-) delete mode 100644 gpu-operator/conf.py create mode 100644 gpu-operator/manifests/output/nri-get-pods-restart.txt diff --git a/gpu-operator/cdi.rst b/gpu-operator/cdi.rst index 5c8a9522f..d47e5591f 100644 --- a/gpu-operator/cdi.rst +++ b/gpu-operator/cdi.rst @@ -16,13 +16,15 @@ .. headings # #, * *, =, -, ^, " -############################################################ -Container Device Interface (CDI) Support in the GPU Operator -############################################################ +################################################################################# +Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support +################################################################################# -************************************ -About the Container Device Interface -************************************ +This page gives an overview of CDI and NRI Plugin support in the GPU Operator. + +************************************** +About Container Device Interface (CDI) +************************************** The `Container Device Interface (CDI) `_ is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means, @@ -31,23 +33,22 @@ ensure that a device is available in a container. CDI simplifies adding support the specification is applicable to all container runtimes that support CDI. Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes. -Specifically, CDI support in container runtimes, e.g. containerd and cri-o, is used to inject GPU(s) into workload +Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled ``nvidia`` runtime class. Use of CDI is transparent to cluster administrators and application developers. The benefits of CDI are largely to reduce development and support for runtime-specific plugins. -******************************** -Enabling CDI During Installation -******************************** +************ +Enabling CDI +************ CDI is enabled by default during installation in GPU Operator v25.10.0 and later. Follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page. CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later. -******************************* Enabling CDI After Installation ******************************* @@ -125,3 +126,79 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f nvidia.com/gpu.deploy.operator-validator=true \ nvidia.com/gpu.present=true \ --overwrite + + +.. _nri-plugin: + +********************************************** +About the Node Resource Interface (NRI) Plugin +********************************************** + +Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like CRI-O and containerd. +NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including inject devices (CDI devices, Linux device nodes, device mounts) to a container, topology aware placement strategies, and more. +For more details on NRI, refer to the `NRI overview `_ in the containerd repository. + +When enabled in the GPU Operator, the NRI Plugin, managed by the NVIDIA Container Toolkit, provides an alternative to the ``nvidia`` runtime class to provision GPU workload pods. +It allows the GPU Operator to extend the container runtime behaviour without modifying the container runtime. +This feature also simplifies deployments on platforms like k3s, k0s, or RKE, because the GPU Operator no longer needs setting of values like ``CONTAINERD_CONFIG``, ``CONTAINERD_SOCKET``, or ``RUNTIME_CONFIG_SOURCE``. + +*********************** +Enabling the NRI Plugin +*********************** + +The NRI Plugin requires the following: + +- CDI to be enabled in the GPU Operator. + +- CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x. + If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. + +To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page and include the ``--set cdi.nriPluginEnabled=true`` argument in you Helm command. + +Enabling the NRI Plugin After Installation +****************************************** + +#. Enable NRI Plugin by modifying the cluster policy: + + .. code-block:: console + + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]' + + *Example Output* + + .. code-block:: output + + clusterpolicy.nvidia.com/cluster-policy patched + +#. (Optional) Confirm that the container toolkit and device plugin pods restart: + + .. code-block:: console + + $ kubectl get pods -n gpu-operator + + *Example Output* + + .. literalinclude:: ./manifests/output/nri-get-pods-restart.txt + :language: output + :emphasize-lines: 6,9 + + +************************ +Disabling the NRI Plugin +************************ + +Disable the NRI Plugin and use the ``nvidia`` runtime class instead with the following procedure: + +Disable the NRI Plugin by modifying the cluster policy: + +.. code-block:: console + + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]' + +*Example Output* + +.. code-block:: output + + clusterpolicy.nvidia.com/cluster-policy patched diff --git a/gpu-operator/conf.py b/gpu-operator/conf.py deleted file mode 100644 index 464c78557..000000000 --- a/gpu-operator/conf.py +++ /dev/null @@ -1,226 +0,0 @@ - -import sphinx -import os -import logging -import sys -from string import Template - -logger = logging.getLogger(__name__) - -sys.path += [ - "/work/_repo/deps/repo_docs/omni/repo/docs/include", -] - - -project = "NVIDIA GPU Operator" - -copyright = "2020-2026, NVIDIA Corporation" -author = "NVIDIA Corporation" - -release = "25.10" -root_doc = "index" - -extensions = [ - "sphinx.ext.autodoc", # include documentation from docstrings - "sphinx.ext.ifconfig", # conditional include of text - "sphinx.ext.napoleon", # support for NumPy and Google style docstrings - "sphinx.ext.intersphinx", # link to other projects' documentation - "sphinx.ext.extlinks", # add roles to shorten external links - "myst_parser", # markdown parsing - "sphinxcontrib.mermaid", # create diagrams using text and code - "sphinxcontrib.youtube", # adds youtube:: directive - "sphinxemoji.sphinxemoji", # adds emoji substitutions (e.g. |:fire:|) - "sphinx_design", - "repo_docs.ext.inline_only", - "repo_docs.ext.toctree", - "repo_docs.ext.mdinclude", - "repo_docs.ext.include_patch", - "repo_docs.ext.youtube", - "repo_docs.ext.ifconfig", - "repo_docs.ext.source_substitutions", - "repo_docs.ext.mermaid", - "repo_docs.ext.exhale_file_fix", - "repo_docs.ext.output_format_text", - "repo_docs.ext.output_format_latex", - "repo_docs.ext.include_licenses", - "repo_docs.ext.add_templates", - "repo_docs.ext.breadcrumbs", - "repo_docs.ext.metadata", - "repo_docs.ext.confval", - "repo_docs.ext.customize_layout", - "repo_docs.ext.cpp_xrefs", -] - -# automatically add section level labels, up to level 4 -myst_heading_anchors = 4 - - -# configure sphinxcontrib.mermaid as we inject mermaid manually on pages that need it -mermaid_init_js = "" -mermaid_version= "" - - -intersphinx_mapping = {} -exclude_patterns = [ - ".git", - "Thumbs.db", - ".DS_Store", - ".pytest_cache", - "_repo", - "README.md", - "life-cycle-policy.rst", - "_build/docs/secure-services-istio-keycloak", - "_build/docs/openshift", - "_build/docs/gpu-telemetry", - "_build/docs/container-toolkit", - "_build/docs/review", - "_build/docs/partner-validated", - "_build/docs/driver-containers", - "_build/docs/sphinx_warnings.txt", - "_build/docs/kubernetes", - "_build/docs/tmp", - "_build/docs/dra-driver", - "_build/docs/edge", - "_build/docs/gpu-operator/24.9.1", - "_build/docs/gpu-operator/24.12.0", - "_build/docs/gpu-operator/25.3.4", - "_build/docs/gpu-operator/25.3.1", - "_build/docs/gpu-operator/24.9.2", - "_build/docs/gpu-operator/version1.json", - "_build/docs/gpu-operator/24.9", - "_build/docs/gpu-operator/25.3.0", - "_build/docs/gpu-operator/25.3", - "_build/docs/gpu-operator/25.10", -] - -html_theme = "sphinx_rtd_theme" - -html_logo = "/work/assets/nvidia-logo-white.png" -html_favicon = "/work/assets/favicon.ico" - -# If true, links to the reST sources are added to the pages. -html_show_sourcelink = False - -html_additional_search_indices = [] - -# If true, the raw source is copied which might be a problem if content is removed with `ifconfig` -html_copy_source = False - -# If true, "Created using Sphinx" is shown in the HTML footer. Default is True. -html_show_sphinx = False - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = [ - "/work/_repo/deps/repo_docs/media", -] - -html_last_updated_fmt = "" - -# https://sphinx-rtd-theme.readthedocs.io/en/stable/configuring.html -html_theme_options = { - "logo_only": True, - "prev_next_buttons_location": None, # our docs aren't a novel... - "navigation_depth": 10, -} - -html_extra_content_head = [' \n '] -html_extra_content_footer = [' \n '] -html_logo_target_url = "" - -html_breadcrumbs_home_url = "" -html_extra_breadcrumbs = [] - -html_css_files = [ - "omni-style.css", - "api-styles.css", -] - -html_js_files = [ - "version.js", - "social-media.js", -] - -# literal blocks default to c++ (useful for Doxygen \code blocks) -highlight_language = 'c++' - - -# add additional tags - - - -source_substitutions = {'minor_version': '25.10', 'version': 'v25.10.1', 'recommended': '580.105.08', 'dra_version': '25.12.0'} -source_substitutions.update({ - 'repo_docs_config': 'debug', - 'repo_docs_platform_target': 'linux-x86_64', - 'repo_docs_platform': 'linux-x86_64', - 'repo_docs_dash_build': '', - 'repo_docs_project': 'gpu-operator', - 'repo_docs_version': '25.10', - 'repo_docs_copyright': '2020-2026, NVIDIA Corporation', - # note: the leading '/' means this is relative to the docs_root (the source directory) - 'repo_docs_api_path': '/../_build/docs/gpu-operator/latest', -}) - -# add global metadata for all built pages -metadata_global = {} - -sphinx_event_handlers = [] -myst_enable_extensions = [ - "colon_fence", "dollarmath", -] -templates_path = ['/work/templates'] -extensions.extend([ - "linuxdoc.rstFlatTable", - "sphinx.ext.autosectionlabel", - "sphinx_copybutton", - "sphinx_design", -]) -suppress_warnings = [ 'autosectionlabel.*' ] -pygments_style = 'sphinx' -copybutton_exclude = '.linenos, .gp' - -html_theme = "nvidia_sphinx_theme" -html_copy_source = False -html_show_sourcelink = False -html_show_sphinx = False - -html_domain_indices = False -html_use_index = False -html_extra_path = ["versions1.json"] -html_static_path = ["/work/css"] -html_css_files = ["custom.css"] - -html_theme_options = { - "icon_links": [], - "switcher": { - "json_url": "../versions1.json", - "version_match": release, - }, -} - -highlight_language = 'console' - -intersphinx_mapping = { - "dcgm": ("https://docs.nvidia.com/datacenter/dcgm/latest/", "../work/dcgm-offline.inv"), - "gpuop": ("https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/", - ("_build/docs/gpu-operator/latest/objects.inv", None)), - "ctk": ("https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/", - ("_build/docs/container-toolkit/latest/objects.inv", None)), - "drv": ("https://docs.nvidia.com/datacenter/cloud-native/driver-containers/latest/", - ("_build/docs/driver-containers/latest/objects.inv", None)), - "ocp": ("https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/", - ("_build/docs/openshift/latest/objects.inv", None)), - "edge": ("https://docs.nvidia.com/datacenter/cloud-native/edge/latest/", - ("_build/docs/edge/latest/objects.inv", None)), -} -rst_epilog = ".. |gitlab_mr_url| replace:: Sorry Charlie...not a merge request." -if os.environ.get("CI_MERGE_REQUEST_IID") is not None: - rst_epilog = ".. |gitlab_mr_url| replace:: {}/-/merge_requests/{}".format( - os.environ["CI_MERGE_REQUEST_PROJECT_URL"], os.environ["CI_MERGE_REQUEST_IID"]) - -def setup(app): - app.add_config_value('build_name', 'public', 'env') - for (event, handler) in sphinx_event_handlers: - app.connect(event, handler) diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index 86d59c069..ee5684475 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -142,13 +142,19 @@ To view all the options, run ``helm show values nvidia/gpu-operator``. * - ``cdi.enabled`` - When set to ``true`` (default), the Container Device Interface (CDI) will be used for - injecting GPUs into workload containers. The Operator will no longer configure the `nvidia` - runtime class as the default runtime handler. Instead, native-CDI support in container runtimes - like containerd or cri-o will be leveraged for injecting GPUs into workload containers. - Using CDI aligns the Operator with the recent efforts to standardize how complex devices like GPUs - are exposed to containerized environments. + injecting GPUs into workload containers. + The Operator will no longer configure the ``nvidia`` runtime class as the default runtime handler. + Instead, native-CDI support in container runtimes like containerd or cri-o will be leveraged for injecting GPUs into workload containers. + Refer to the :doc:`cdi` page for more information. - ``true`` + * - ``cdi.nriPluginEnabled`` + - When set to ``true``, the Node Resource Interface (NRI) Plugin will be used for injecting GPUs into workload containers. + In NRI Plugin mode, the NVIDIA Container Toolkit will no longer modify the runtime config. + This feature requires CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x. + Refer to the :doc:`cdi` page for more information. + - ``false`` + * - ``cdi.default`` Deprecated. - This field is deprecated as of v25.10.0 and will be ignored. The ``cdi.enabled`` field is set to ``true`` by default in versions 25.10.0 and later. @@ -509,6 +515,11 @@ If you need to specify custom values, refer to the following sample command for --set toolkit.env[2].name=RUNTIME_CONFIG_SOURCE \ --set toolkit.env[2].value="command,file" +.. note:: + + If you are using the NRI Plugin with CDI, you do not need to specify the ``toolkit.env`` options. This will be done automatically by the NRI Plugin. + Refer to the :ref:`NRI Plugin ` documentation, for more information on the feature + These options are defined as follows: CONTAINERD_CONFIG diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst index afa96c50b..640f763d0 100644 --- a/gpu-operator/index.rst +++ b/gpu-operator/index.rst @@ -48,7 +48,7 @@ Custom GPU Driver Parameters precompiled-drivers.rst GPU Driver CRD - Container Device Interface (CDI) Support + CDI and NRI Support .. toctree:: :caption: Sandboxed Workloads diff --git a/gpu-operator/install-gpu-operator-gov-ready.rst b/gpu-operator/install-gpu-operator-gov-ready.rst index 875db7cb0..787615e76 100644 --- a/gpu-operator/install-gpu-operator-gov-ready.rst +++ b/gpu-operator/install-gpu-operator-gov-ready.rst @@ -41,19 +41,19 @@ The government-ready NVIDIA GPU Operator includes the following components: * - Component - Version * - NVIDIA GPU Operator - - v25.10.0 + - v26.3.0 * - NVIDIA GPU Feature Discovery - - 0.18.0 + - 0.18.2 * - NVIDIA Container Toolkit - - 1.18.0 + - 1.19.0 * - NVIDIA Device Plugin - - 0.18.0 + - 0.18.2 * - NVIDIA DCGM-exporter - - 4.4.1-4.6.0 + - v4.5.1-4.8.0 * - NVIDIA MIG Manager - - 0.13.0 + - 0.13.1 * - NVIDIA Driver - - 580.95.05 |fn1|_ + - 580.126.20 |fn1|_ :sup:`1` Hardened for STIG/FIPS compliance @@ -62,7 +62,7 @@ Artifacts for these components are available from the `NVIDIA NGC Catalog `_ (**D**, **R**) - | `580.82.07 `_ - | `575.57.08 `_ - | `570.195.03 `_ - | `550.163.01 `_ - | `535.274.02 `_ - | `590.48.01 `_ - | `580.126.16 `_ (**R**) - | `580.126.09 `_ - | `580.105.08 `_ (**D**) - | `580.95.05 `_ - | `580.82.07 `_ - | `575.57.08 `_ - | `570.211.01 `_ - | `570.195.03 `_ + | `580.126.20 `_ (**D**, **R**) + | `575.57.08 `_ | `550.163.01 `_ | `535.288.01 `_ - | `535.274.02 `_ - * - NVIDIA Driver Manager for Kubernetes - - `v0.9.0 `__ - `v0.9.1 `__ * - NVIDIA Container Toolkit - - `1.18.0 `__ + - `1.19 `__ * - NVIDIA Kubernetes Device Plugin - - `0.18.0 `__ - - `0.18.1 `__ + - `0.18.2 `__ * - DCGM Exporter - - `v4.4.1-4.6.0 `__ - - `v4.4.2-4.7.0 `__ + - `v4.5.1-4.8.0 `__ * - Node Feature Discovery - `v0.18.2 `__ * - | NVIDIA GPU Feature Discovery | for Kubernetes - - `0.18.1 `__ + - `0.18.2 `__ * - NVIDIA MIG Manager for Kubernetes - - `0.13.0 `__ - `0.13.1 `__ * - DCGM - - `4.4.1 `__ - - `4.4.2-1 `__ + - `4.5.2-1 `__ * - Validator for NVIDIA GPU Operator - - v25.10.0 + - v26.3 * - NVIDIA KubeVirt GPU Device Plugin - `v1.4.0 `__ @@ -153,7 +132,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. - `v0.4.1 `__ * - NVIDIA GDS Driver |gds|_ - - `2.26.6 `__ + - `2.27.3 `__ * - NVIDIA Kata Manager for Kubernetes - `v0.2.3 `__ diff --git a/gpu-operator/manifests/output/nri-get-pods-restart.txt b/gpu-operator/manifests/output/nri-get-pods-restart.txt new file mode 100644 index 000000000..7f00e5cc5 --- /dev/null +++ b/gpu-operator/manifests/output/nri-get-pods-restart.txt @@ -0,0 +1,12 @@ +NAME READY STATUS RESTARTS AGE +gpu-feature-discovery-qnw2q 1/1 Running 0 47h +gpu-operator-6d59774ff-hznmr 1/1 Running 0 2d +gpu-operator-node-feature-discovery-master-6d6649d597-7l8bj 1/1 Running 0 2d +gpu-operator-node-feature-discovery-worker-v86vj 1/1 Running 0 2d +nvidia-container-toolkit-daemonset-2768s 1/1 Running 0 2m11s +nvidia-cuda-validator-ls4vc 0/1 Completed 0 47h +nvidia-dcgm-exporter-fxp9h 1/1 Running 0 47h +nvidia-device-plugin-daemonset-dvp4v 1/1 Running 0 2m26s +nvidia-device-plugin-validator-kvxbs 0/1 Completed 0 47h +nvidia-driver-daemonset-m86r7 1/1 Running 0 2d +nvidia-operator-validator-xg98r 1/1 Running 0 47h diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index d854388bd..7a887b756 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -31,8 +31,9 @@ Platform Support .. _supported nvidia gpus and systems: +********************************************* Supported NVIDIA Data Center GPUs and Systems ---------------------------------------------- +********************************************* The following NVIDIA data center GPUs are supported on x86 based platforms: @@ -152,6 +153,8 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: +-------------------------+------------------------+-------+ | NVIDIA RTX PRO 6000D | NVIDIA Blackwell | | +-------------------------+------------------------+-------+ + | NVIDIA RTX PRO 4500 | NVIDIA Blackwell | | + +-------------------------+------------------------+-------+ | NVIDIA RTX A6000 | NVIDIA Ampere /Ada | | +-------------------------+------------------------+-------+ | NVIDIA RTX A5000 | NVIDIA Ampere | | @@ -188,6 +191,8 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: +-------------------------+------------------------+-------+ | Product | Architecture | Notes | +=========================+========================+=======+ + | NVIDIA DGX B300 | NVIDIA Blackwell | | + +-------------------------+------------------------+-------+ | NVIDIA DGX B200 | NVIDIA Blackwell | | +-------------------------+------------------------+-------+ | NVIDIA DGX Spark | NVIDIA Blackwell | | @@ -198,8 +203,12 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: +-------------------------+------------------------+-------+ | NVIDIA HGX GB200 NVL72 | NVIDIA Blackwell | | +-------------------------+------------------------+-------+ + | NVIDIA HGX GB200 NVL4 | NVIDIA Blackwell | | + +-------------------------+------------------------+-------+ | NVIDIA HGX GB300 NVL72 | NVIDIA Blackwell | | +-------------------------+------------------------+-------+ + | NVIDIA DGX Station | NVIDIA Blackwell | | + +-------------------------+------------------------+-------+ .. note:: @@ -208,8 +217,9 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: .. _gpu-operator-arm-platforms: +***************************** Supported ARM Based Platforms ------------------------------ +***************************** The following NVIDIA data center GPUs are supported: @@ -247,8 +257,9 @@ system that meets the following requirements is supported: .. _Supported Deployment Options, Hypervisors, and NVIDIA vGPU Based Products: +**************************** Supported Deployment Options ----------------------------- +**************************** The GPU Operator has been validated in the following scenarios: @@ -268,8 +279,9 @@ The GPU Operator has been validated in the following scenarios: .. _container-platforms: +**************************************************** Supported Operating Systems and Kubernetes Platforms ----------------------------------------------------- +**************************************************** .. _fn1: #kubernetes-version .. |fn1| replace:: :sup:`1` @@ -277,8 +289,8 @@ Supported Operating Systems and Kubernetes Platforms .. |fn2| replace:: :sup:`2` .. _fn3: #rhel-9 .. |fn3| replace:: :sup:`3` -.. _fn4: #k8s-version -.. |fn4| replace:: :sup:`4` +.. _fn5: #azure-linux-3 +.. |fn5| replace:: :sup:`5` The GPU Operator has been validated in the following scenarios: @@ -292,16 +304,17 @@ The GPU Operator has been validated in the following scenarios: * - | Operating | System - - Kubernetes |fn1|_, |fn4|_ + - Kubernetes |fn1|_ - | Red Hat | OpenShift - | VMware vSphere | Kubernetes Service (VKS) - | Rancher Kubernetes - | Engine 2 |fn4|_ - - | Mirantis k0s |fn4|_ + | Engine 2 + - | K3s + - | Mirantis k0s - | Canonical - | MicroK8s |fn4|_ + | MicroK8s - | Nutanix | NKP @@ -312,6 +325,7 @@ The GPU Operator has been validated in the following scenarios: - 1.30---1.35 - - + - - 2.12, 2.13, 2.14 * - Ubuntu 22.04 LTS |fn2|_ @@ -319,6 +333,7 @@ The GPU Operator has been validated in the following scenarios: - - 1.30---1.35 - 1.30---1.35 + - 1.30---1.35 - 1.30---1.35 - 1.33---1.35 - 2.12, 2.13, 2.14, 2.15 @@ -329,6 +344,7 @@ The GPU Operator has been validated in the following scenarios: - - 1.30---1.35 - 1.30---1.35 + - 1.30---1.35 - 1.33---1.35 - @@ -340,10 +356,11 @@ The GPU Operator has been validated in the following scenarios: - - - + - * - | Red Hat | Enterprise - | Linux 9.2, 9.4, 9.6 |fn3|_ + | Linux 10.0, 10.1 - 1.30---1.35 - - @@ -351,6 +368,19 @@ The GPU Operator has been validated in the following scenarios: - - - + - + + * - | Red Hat + | Enterprise + | Linux 9.2, 9.4, 9.6, 9.7, 9.8 |fn3|_ + - 1.30---1.35 + - + - + - 1.30---1.35 + - + - + - + - * - | Red Hat | Enterprise @@ -362,6 +392,7 @@ The GPU Operator has been validated in the following scenarios: - 1.30---1.35 - - + - - 2.12, 2.13, 2.14, 2.15 .. _kubernetes-version: @@ -387,11 +418,6 @@ The GPU Operator has been validated in the following scenarios: Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, and 9.6 versions are available for x86 based platforms only. They are not available for ARM based systems. - .. _k8s-version: - - :sup:`4` - Kubernetes v1.35 support was added in v25.10.1 and later. - .. note:: |ocp_csp_support| @@ -426,14 +452,23 @@ The GPU Operator has been validated in the following scenarios: - 1.30---1.35 - 1.30---1.35 - - Kubernetes v1.35 support was added in v25.10.1 and later. + * - Azure Linux 3 (Local Program) |fn5|_ + - + - + - 1.30---1.35 + + .. _azure-linux-3: + + :sup:`5` + Azure Linux 3 are available as precompiled drivers and signed vGPU Guest Driver. + .. _supported-precompiled-drivers: +***************************** Supported Precompiled Drivers ------------------------------ +***************************** The GPU Operator has been validated with the following precompiled drivers. See the :doc:`precompiled-drivers` page for more information about using precompiled drivers. @@ -452,14 +487,14 @@ See the :doc:`precompiled-drivers` page for more information about using precomp +----------------------------+------------------------+----------------+---------------------+ - +**************************** Supported Container Runtimes ----------------------------- +**************************** The GPU Operator has been validated for the following container runtimes: +----------------------------+------------------------+----------------+ -| Operating System | Containerd 1.7 - 2.1 | CRI-O | +| Operating System | Containerd 1.7 - 2.2 | CRI-O | +============================+========================+================+ | Ubuntu 20.04 LTS | Yes | Yes | +----------------------------+------------------------+----------------+ @@ -474,9 +509,14 @@ The GPU Operator has been validated for the following container runtimes: | Red Hat Enterprise Linux 9 | Yes | Yes | +----------------------------+------------------------+----------------+ +.. note:: + + If you are planning to use the NRI Plugin, you must use at least CRI-O version v1.34.0 or containerd version v1.7.30, v2.1.x and v2.2.x. + If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. +************************************************* Support for KubeVirt and OpenShift Virtualization -------------------------------------------------- +************************************************* Red Hat OpenShift Virtualization is based on KubeVirt. @@ -487,13 +527,12 @@ Operating System Kubernetes KubeVirt OpenShift Virtual \ \ | GPU vGPU | GPU vGPU | Passthrough | Passthrough ================ =========== ============= ========= ============= =========== -Ubuntu 20.04 LTS 1.30---1.35 0.36+ 0.59.1+ +Ubuntu 24.04 LTS 1.30---1.35 0.36+ 0.59.1+ Ubuntu 22.04 LTS 1.30---1.35 0.36+ 0.59.1+ +Ubuntu 20.04 LTS 1.30---1.35 0.36+ 0.59.1+ Red Hat Core OS 4.14---4.21 4.14---4.21 ================ =========== ============= ========= ============= =========== -Kubernetes v1.35 support was added in v25.10.1 and later. - You can run GPU passthrough and NVIDIA vGPU in the same cluster as long as you use a software version that meets both requirements. @@ -519,14 +558,15 @@ KubeVirt and OpenShift Virtualization with NVIDIA vGPU is supported on the follo The L40G GPU is excluded. -Note that HGX platforms are not supported. +- NVIDIA HGX GB200 NVL72, GB300 NVL72 on Ubuntu 24.04 LTS. .. note:: KubeVirt with NVIDIA vGPU is supported on ``nodes`` with Linux kernel < 6.0, such as Ubuntu 22.04 ``LTS``. +************************** Support for GPUDirect RDMA --------------------------- +************************** Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. @@ -538,9 +578,9 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`. - +***************************** Support for GPUDirect Storage ------------------------------ +***************************** Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage. @@ -560,8 +600,9 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage. Not supported with secure boot. Supported storage types are local NVMe and remote NFS. +******************************************* Additional Supported Tools and Integrations --------------------------------------------- +******************************************* Container management tools: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index bd4634d96..01ed9ff3e 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -33,6 +33,59 @@ Refer to the :ref:`GPU Operator Component Matrix` for a list of software compone ---- +.. _v26.3.0: + +26.3.0 +======= + +New Features +------------ + +* Updated software component versions: + + - NVIDIA Driver Manager for Kubernetes v0.9.1 + - NVIDIA Container Toolkit v1.19.0 + - NVIDIA DCGM v4.5.2-1 + - NVIDIA DCGM Exporter v4.5.1-4.8.0 + - NVIDIA GDS Driver v2.27.3 + - NVIDIA Kubernetes Device Plugin v0.18.2 + - NVIDIA MIG Manager for Kubernetes v0.13.1 + - NVIDIA GPU Feature Discovery for Kubernetes v0.18.2 + +* Added support for these NVIDIA Data Center GPU Driver versions: + + - 580.126.20 (default) + +* Added support for Node Resource Interface (NRI) Plugin. + This is a new way of injecting GPU management CDI devices into operands, replacing the ``nvidia`` runtime class. + Enable by setting the ``cdi.nriPluginEnabled`` field to ``true`` in the ClusterPolicy custom resource or by setting the ``cdi.nriPluginEnabled`` flag in the Helm chart. + When enabled the NRI Plugin is enabled, the GPU Operator no longer requires setting of values like ``CONTAINERD_CONFIG``, ``CONTAINERD_SOCKET``, or ``RUNTIME_CONFIG_SOURCE`` on platforms such as K3s, k0s, and RKE. + This feature requires CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x. + If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. + +* Added support for KubeVirt vGPU with Ubuntu 24.04 LTS and the VFIO framework. + +* Added support for vGPU precompiled driver container for Azure Linux. + +* Added support for K3s. + +* Added support for new MIG profiles with NVIDIA HGX GB300 NVL72. + +Improvements +------------ + +* Improved the GPU driver container to use fast-path optimization that avoids unnecessary driver reinstalls when GPU workloads are running. This reduces downtime from minutes to ~10 seconds. + +Fixed Issues +------------ + +* + + +Known Issues +------------ + +- Were any known issues fixed? .. _v25.10.1: diff --git a/gpu-operator/versions.json b/gpu-operator/versions.json index c893de42f..d8e53c86c 100644 --- a/gpu-operator/versions.json +++ b/gpu-operator/versions.json @@ -1,7 +1,10 @@ { - "latest": "25.10", + "latest": "26.3", "versions": [ + { + "version": "26.3" + }, { "version": "25.10" }, diff --git a/gpu-operator/versions1.json b/gpu-operator/versions1.json index 3557db032..f29832ee9 100644 --- a/gpu-operator/versions1.json +++ b/gpu-operator/versions1.json @@ -1,6 +1,10 @@ [ { "preferred": "true", + "url": "../26.3", + "version": "26.3" + }, + { "url": "../25.10", "version": "25.10" }, From 98273f15801f0dd8713309010e92c3d1395c86b5 Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Wed, 25 Feb 2026 15:16:52 -0500 Subject: [PATCH 2/3] add rocky linux Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/platform-support.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index 7a887b756..da8229b84 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -395,6 +395,16 @@ The GPU Operator has been validated in the following scenarios: - - 2.12, 2.13, 2.14, 2.15 + * - | Rocky Linux 9.7 + - 1.30---1.35 + - + - + - + - + - + - + - + .. _kubernetes-version: :sup:`1` From a4293b162c5c0ba8c13e6dc7cdd727990489d62b Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Fri, 27 Feb 2026 11:42:01 -0500 Subject: [PATCH 3/3] Update platform support, add dynamic mig Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/getting-started.rst | 4 + gpu-operator/gpu-operator-mig.rst | 81 +++++++++++++++---- gpu-operator/index.rst | 2 +- .../manifests/input/mig-cm-values.yaml | 11 +++ gpu-operator/platform-support.rst | 77 +++++++++--------- gpu-operator/release-notes.rst | 53 +++++++++++- 6 files changed, 174 insertions(+), 54 deletions(-) diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index ee5684475..c25774723 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -180,6 +180,10 @@ To view all the options, run ``helm show values nvidia/gpu-operator``. Available values are ``Cluster`` (default) or ``Local``. - ``Cluster`` + * - ``dcgmExporter.hostNetwork`` + - When set to ``true``, the DCGM Exporter will expose a metric port on the host's network namespace. + - ``false`` + * - ``devicePlugin.config`` - Specifies the configuration for the NVIDIA Device Plugin as a config map. diff --git a/gpu-operator/gpu-operator-mig.rst b/gpu-operator/gpu-operator-mig.rst index 365a8f25c..1cdc79f10 100644 --- a/gpu-operator/gpu-operator-mig.rst +++ b/gpu-operator/gpu-operator-mig.rst @@ -34,16 +34,18 @@ Multi-Instance GPU (MIG) enables GPUs based on the NVIDIA Ampere and later archi Refer to the `MIG User Guide `__ for more information about MIG. GPU Operator deploys MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. +You must enable MIG during installation by chosing a MIG strategy before you can configure MIG. + +Also see the :ref:`architecture section ` for more information about how MIG is supported in the GPU Operator. ******************************** Enabling MIG During Installation ******************************** +Use the following steps to enable MIG. +The example below sets ``single`` as the MIG strategy. +Set ``mig.strategy`` to ``mixed`` when MIG mode is not enabled on all GPUs on a node. -The following steps use the ``single`` MIG strategy. -Alternatively, you can specify the ``mixed`` strategy. - -Perform the following steps to install the Operator and configure MIG: #. Install the Operator: @@ -55,13 +57,12 @@ Perform the following steps to install the Operator and configure MIG: --version=${version} \ --set mig.strategy=single - Set ``mig.strategy`` to ``mixed`` when MIG mode is not enabled on all GPUs on a node. - In a CSP environment such as Google Cloud, also specify + In a cloud service provider (CSP) environment such as Google Cloud, also specify ``--set migManager.env[0].name=WITH_REBOOT --set-string migManager.env[0].value=true`` to ensure that the node reboots and can apply the MIG configuration. - MIG Manager supports preinstalled drivers. + MIG Manager supports preinstalled drivers, meaning drivers that are not managed by the GPU Operator and you installed directly on the host. If drivers are preinstalled, also specify ``--set driver.enabled=false``. Refer to :ref:`mig-with-preinstalled-drivers` for more details. @@ -110,10 +111,22 @@ Perform the following steps to install the Operator and configure MIG: Configuring MIG Profiles ************************ -By default, nodes are labeled with ``nvidia.com/mig.config: all-disabled`` and you must specify the MIG configuration to apply. +By default, nodes are labeled with ``nvidia.com/mig.config: all-disabled``. +To use a profile on a node, you add a label with the desired profile, for example, ``nvidia.com/mig.config=all-1g.10gb``. +MIG Manager uses a auto-generated ``default-mig-parted-config`` config map in the GPU Operator namespace to identify supported MIG profiles. Refer to the config map when you label the node or customize the config map. + +Introduced in GPU Operator v26.3.0, MIG Manager dynamically generates the MIG configuration for a node at runtime from the available hardware. +The configuration is genearted on startup, discovering MIG profiles for each GPU on a node using NVML, then writing it to a per-node ConfigMap. +Each ConfigMap contains a complete mig-parted config, including ``all-disabled``, ``all-enabled``, per-profile configs such as ``all-1g.10gb``, and ``all-balanced`` with device-filter support for mixed GPU types. -MIG Manager uses the ``default-mig-parted-config`` config map in the GPU Operator namespace to identify supported MIG profiles. -Refer to the config map when you label the node or customize the config map. +When a new MIG-capable GPU is added to a node, the new GPU is automatically supported. + +While its recommended that you use the auto-generated MIG configuration file, you are able to provide your own ConfigMap if you need custom profiles. +When a custom config is supplied, MIG Manager uses it instead of generating one. +You can use the Helm chart to create a ConfigMap from values at install time, or create and reference your own ConfigMap. + +.. note:: + Dynamic MIG configuration may not be available on older drivers, such as R535, as they don't support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use the static config file for MIG profiles. Example: Single MIG Strategy ============================ @@ -327,18 +340,49 @@ The following steps show how to update a GPU on a node to the ``3g.40gb`` profil } +.. _dynamically-creating-the-mig-configuration-configmap: + +Dynamically Creating the MIG Configuration ConfigMap +==================================================== + +You can have the Helm chart create the MIG configuration ConfigMap at install or upgrade time from values, so you do not need to apply a separate ConfigMap manifest. The chart creates the ConfigMap when both of the following are set: + +* ``migManager.config.create``: ``true`` +* ``migManager.config.data``: non-empty (for example, a ``config.yaml`` key with mig-parted content) + + Example: Custom MIG Configuration During Installation ===================================================== -By default, the Operator creates the ``default-mig-parted-config`` config map and MIG Manager is configured to read profiles from that config map. +ConfigMap name (either existing or to create with create=true) + # If name is provided, mig-manager will use this config instead of auto-generated one. + # REQUIREMENT: + name: "" + # Data section for the ConfigMap (required only if create=true) + +By default, the Operator auto-generates a per-node ``default-mig-parted-config`` ConfigMap. +If you need to use custom profiles, you can create a custom ConfigMap during installation by passing in a name and data for the ConfigMap with the Helm command. -You can use the ``values.yaml`` file when you install or upgrade the Operator to create a config map with a custom configuration. +The MIG Manager daemonset is configured to use this ConfigMap instead of the auto-generated one. + +In your values.yaml file, set ``migManager.config.create`` to ``true``, set ``migManager.config.name``, and add the config map data under ``migManager.config.data``, for example: #. In your ``values.yaml`` file, add the data for the config map, like the following example: .. literalinclude:: manifests/input/mig-cm-values.yaml :language: yaml +.. note:: + Custom ConfigMaps must contain a key named "config.yaml" + +#. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: + + .. code-block:: console + + $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ + nvidia/gpu-operator --version=${version} \ + -f values.yaml + #. If the custom configuration specifies more than one instance profile, set the strategy to ``mixed``: .. code-block:: console @@ -354,18 +398,24 @@ You can use the ``values.yaml`` file when you install or upgrade the Operator to $ kubectl label nodes nvidia.com/mig.config=custom-mig --overwrite +.. _example-custom-mig-configuration: + Example: Custom MIG Configuration ================================= -By default, the Operator creates the ``default-mig-parted-config`` config map and MIG Manager is configured to read profiles from that config map. +By default, the Operator creates the ``default-mig-parted-config`` ConfigMap and MIG Manager reads profiles from it. -You can create a config map with a custom configuration if the default profiles do not meet your business needs. +You can instead create and apply a ConfigMap yourself if the default profiles do not meet your needs. #. Create a file, such as ``custom-mig-config.yaml``, with contents like the following example: .. literalinclude:: manifests/input/custom-mig-config.yaml :language: yaml + +.. note:: + Custom ConfigMaps must contain a key named "config.yaml" + #. Apply the manifest: .. code-block:: console @@ -523,6 +573,8 @@ Alternatively, you can create a custom config map for use by MIG Manager by perf --set migManager.gpuClientsConfig.name=gpu-clients --set driver.enabled=false +.. _mig-architecture: + ***************** Architecture ***************** @@ -536,6 +588,7 @@ Finally, it applies the MIG reconfiguration and restarts the GPU pods and possib The MIG reconfiguration can also involve rebooting a node if a reboot is required to enable MIG mode. The default MIG profiles are specified in the ``default-mig-parted-config`` config map. +This config map is auto-generated by the Operator on startup and contains the standard MIG profiles for the available GPUs on the node. You can specify one of these profiles to apply to the ``mig.config`` label to trigger a reconfiguration of the MIG geometry. MIG Manager uses the `mig-parted `__ tool to apply the configuration diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst index 640f763d0..03fe9fef2 100644 --- a/gpu-operator/index.rst +++ b/gpu-operator/index.rst @@ -41,7 +41,7 @@ :hidden: NVIDIA DRA Driver for GPUs - Multi-Instance GPU + Multi-Instance GPU (MIG) Time-Slicing GPUs gpu-operator-rdma.rst Outdated Kernels diff --git a/gpu-operator/manifests/input/mig-cm-values.yaml b/gpu-operator/manifests/input/mig-cm-values.yaml index 550f19a4b..b3b618b22 100644 --- a/gpu-operator/manifests/input/mig-cm-values.yaml +++ b/gpu-operator/manifests/input/mig-cm-values.yaml @@ -11,7 +11,18 @@ migManager: mig-enabled: false custom-mig: - devices: [0] + mig-enabled: false + - devices: [1] mig-enabled: true mig-devices: "1g.10gb": 2 + - devices: [2] + mig-enabled: true + mig-devices: "2g.20gb": 2 + "3g.40gb": 1 + - devices: [3] + mig-enabled: true + mig-devices: + "3g.40gb": 1 + "4g.40gb": 1 diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index da8229b84..360f914cf 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -154,6 +154,7 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: | NVIDIA RTX PRO 6000D | NVIDIA Blackwell | | +-------------------------+------------------------+-------+ | NVIDIA RTX PRO 4500 | NVIDIA Blackwell | | + | Blackwell Server Edition| | | +-------------------------+------------------------+-------+ | NVIDIA RTX A6000 | NVIDIA Ampere /Ada | | +-------------------------+------------------------+-------+ @@ -319,38 +320,38 @@ The GPU Operator has been validated in the following scenarios: | NKP * - Ubuntu 20.04 LTS |fn2|_ - - 1.30---1.35 + - 1.32---1.35 - - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 - - - - 2.12, 2.13, 2.14 * - Ubuntu 22.04 LTS |fn2|_ - - 1.30---1.35 + - 1.32---1.35 - - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 - 1.33---1.35 - 2.12, 2.13, 2.14, 2.15 * - Ubuntu 24.04 LTS - - 1.30---1.35 + - 1.32---1.35 - - - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 - 1.33---1.35 - * - Red Hat Core OS - - - | 4.14---4.21 + - | 4.17---4.21 - - - @@ -361,10 +362,10 @@ The GPU Operator has been validated in the following scenarios: * - | Red Hat | Enterprise | Linux 10.0, 10.1 - - 1.30---1.35 + - 1.32---1.35 - - - - 1.30---1.35 + - 1.32---1.35 - - - @@ -372,11 +373,11 @@ The GPU Operator has been validated in the following scenarios: * - | Red Hat | Enterprise - | Linux 9.2, 9.4, 9.6, 9.7, 9.8 |fn3|_ - - 1.30---1.35 + | Linux 9.2, 9.4, 9.6, 9.7 |fn3|_ + - 1.32---1.35 - - - - 1.30---1.35 + - 1.32---1.35 - - - @@ -386,17 +387,17 @@ The GPU Operator has been validated in the following scenarios: | Enterprise | Linux 8.8, | 8.10 - - 1.30---1.35 + - 1.32---1.35 - - - - 1.30---1.35 + - 1.32---1.35 - - - - 2.12, 2.13, 2.14, 2.15 - * - | Rocky Linux 9.7 - - 1.30---1.35 + * - Rocky Linux 9.7 + - 1.32---1.35 - - - @@ -448,24 +449,24 @@ The GPU Operator has been validated in the following scenarios: | Kubernetes Service * - Ubuntu 20.04 LTS - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 * - Ubuntu 22.04 LTS - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 * - Ubuntu 24.04 LTS - - 1.30---1.35 - - 1.30---1.35 - - 1.30---1.35 + - 1.32---1.35 + - 1.32---1.35 + - 1.32---1.35 * - Azure Linux 3 (Local Program) |fn5|_ - - - - 1.30---1.35 + - 1.32---1.35 .. _azure-linux-3: @@ -537,10 +538,10 @@ Operating System Kubernetes KubeVirt OpenShift Virtual \ \ | GPU vGPU | GPU vGPU | Passthrough | Passthrough ================ =========== ============= ========= ============= =========== -Ubuntu 24.04 LTS 1.30---1.35 0.36+ 0.59.1+ -Ubuntu 22.04 LTS 1.30---1.35 0.36+ 0.59.1+ -Ubuntu 20.04 LTS 1.30---1.35 0.36+ 0.59.1+ -Red Hat Core OS 4.14---4.21 4.14---4.21 +Ubuntu 24.04 LTS 1.32---1.35 0.36+ 0.59.1+ +Ubuntu 22.04 LTS 1.32---1.35 0.36+ 0.59.1+ +Ubuntu 20.04 LTS 1.32---1.35 0.36+ 0.59.1+ +Red Hat Core OS 4.17---4.21 4.17---4.21 ================ =========== ============= ========= ============= =========== You can run GPU passthrough and NVIDIA vGPU in the same cluster as long as you use @@ -584,7 +585,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. - Ubuntu 24.04 LTS with Network Operator 25.7.0. - Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0. - Red Hat Enterprise Linux 9.2, 9.4, and 9.6 with Network Operator 25.7.0. -- Red Hat OpenShift 4.14 and higher with Network Operator 25.7.0. +- Red Hat OpenShift 4.17 and higher with Network Operator 25.7.0. For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`. @@ -596,7 +597,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage. - Ubuntu 24.04 LTS Network Operator 25.7.0. - Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0. -- Red Hat OpenShift Container Platform 4.14 and higher. +- Red Hat OpenShift Container Platform 4.17 and higher. .. note:: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index 01ed9ff3e..09e8c80dd 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -63,6 +63,11 @@ New Features This feature requires CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x. If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. +* GPU Feature Discovery now uses the Node Feature API by default instead of feature files to discover GPUs and add GPU node labels to the nodes. + Note, OpenShift clusters do not support the NodeFeature API yet. + +* Add support for dynamic MIG config generation. + * Added support for KubeVirt vGPU with Ubuntu 24.04 LTS and the VFIO framework. * Added support for vGPU precompiled driver container for Azure Linux. @@ -71,16 +76,62 @@ New Features * Added support for new MIG profiles with NVIDIA HGX GB300 NVL72. +* Added support for new operating systems: + + - Rocky Linux 9.7 + - Red Hat Enterprise Linux 10.0, 10.1 + - Red Hat Enterprise Linux 9.7 + +* Added support for including extra manifests with the Helm chart. + +* Added support for the DCGM Exportor to expose a metric port on the host's network namespace. + Enabled by setting ``hostNetwork: true`` in the ClusterPolicy custom resource, or passing ``--set dcgmExporter.hostNetwork=true`` to the Helm chart. (`PR #1962 `_) + +* Added PodSecurityContext support for DaemonSets (`PR #2120 `_) + +* https://github.com/NVIDIA/gpu-operator/pull/2014 + Improvements ------------ * Improved the GPU driver container to use fast-path optimization that avoids unnecessary driver reinstalls when GPU workloads are running. This reduces downtime from minutes to ~10 seconds. +* Improvements to GPU Operator resilience. + +* Improved the NVIDIA Kubernetes Device Plugin to avoid unnecessary GPU unbind/rebind operations during rolling updates of the vfio-manager DaemonSet. + This improves the stability of GPU passthrough workloads using GPU passthrough (KubeVirt, Kata Containers). + +* Improved the Upgrade Controller to decrease unnecessary reconciliation in environments with Node Feature Discovery (NFD) enabled. + +* Improved the GPU Operator to deploy on heterogenous clusters with different operating systems on GPU nodes. + + * Fixed issues where the GPU Operator was not using getting the correct operating system on heterogenous clusters. Now the GPU Operator uses the OS version labels from GPU worker nodes, added by NFD, when determining what OS-specific paths to use for repository configuration files. (`PR #562 `_, `PR #2138 `_) + +* Improved performance by https://github.com/NVIDIA/gpu-operator/pull/2113 + + Fixed Issues ------------ -* +* Fixed an issue where driver installations may fail because cached packages were incorectly being referenced. (`PR #592 `_) + +* Fix shared state issue causing incorrect driver images in multi-node-pool clusters. (`PR #1952 `_) + +* Fixed an issue where the GPU Operator was applying driver upgrade annotations when driver is disabled. (`PR #1968 `_) + +* Fixed an issue where Helm chart device.plugin values were not be validated correctly. (`PR #1999 `_) + +* Fixed an issue on Openshift clusters where the ``dcgm-exporter`` pod gets bound to another Security Context Constraint (SCC) object, not the one named 'nvidia-dcgm-exporter' which the GPU Operator creates. (`PR #2122 `_) + +* Fixed an issue where the GPU Operator was not correctly cleaning up deamonsets https://github.com/NVIDIA/gpu-operator/pull/2081 + +* Fixed an issue where the GPU Operator was not adding a namespace to ServiceAccount objects. (`PR #2039 `_) + + +Removals and Deprecations +------------------------- +* Marked unused field ``defaultRuntime`` as optional in the ClusterPolicy. (`PR #2000 `_) Known Issues ------------