From 7898ae327db4d8dfbfbb3ac9103b189355de8e04 Mon Sep 17 00:00:00 2001 From: Shiv Tyagi Date: Thu, 5 Feb 2026 10:45:33 +0000 Subject: [PATCH] Add guide for CDI This adds a detailed guide about how to use amd-ctk to generate and validate --- README.md | 2 +- docs/container-runtime/cdi-guide.rst | 131 +++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 2 + 4 files changed, 135 insertions(+), 1 deletion(-) create mode 100644 docs/container-runtime/cdi-guide.rst diff --git a/README.md b/README.md index a02f672..896f81a 100644 --- a/README.md +++ b/README.md @@ -98,7 +98,7 @@ To install the AMD Container Toolkit on RHEL/CentOS 9 systems, follow these step > docker run --rm --runtime=amd -e AMD_VISIBLE_DEVICES=0-3,5,8 rocm/rocm-terminal rocm-smi ``` - 2. Using [CDI](https://github.com/cncf-tags/container-device-interface) style + 2. Using [CDI](docs/container-runtime/cdi-guide.rst) style - First, generate the CDI spec. diff --git a/docs/container-runtime/cdi-guide.rst b/docs/container-runtime/cdi-guide.rst new file mode 100644 index 0000000..d1073b6 --- /dev/null +++ b/docs/container-runtime/cdi-guide.rst @@ -0,0 +1,131 @@ +========================================== +Support for Container Device Interface +========================================== + +Overview +======== + +The `Container Device Interface `_ (CDI) is a standardized specification for exposing specialized hardware devices, such as AMD GPUs, to containers in a runtime-agnostic manner. This works consistently across different container runtimes. + +CDI eliminates the need for runtime-specific hooks or shims, like ``amd-container-runtime``, by allowing container runtimes to natively understand and inject device resources into containers. + +The ``amd-ctk`` tool provides commands to generate and manage CDI specifications for AMD GPU devices on your system. + +Prerequisites +============= + +Before using CDI with AMD GPUs, ensure: + +* AMD GPU drivers are properly installed on the host system +* The ``amd-ctk`` tool is installed +* Your container runtime supports CDI + +Generating CDI Specifications +============================== + +To generate a CDI specification for AMD GPUs on your system, run: + +.. code-block:: bash + + sudo amd-ctk cdi generate + +This command: + +* Scans the system for available AMD GPU devices +* Creates a CDI specification file at ``/etc/cdi/amd.json`` +* Defines device nodes, mount points, and environment variables needed for each GPU + +**Custom Output Location** + +To generate the specification in a different location, use the ``--output`` flag: + +.. code-block:: bash + + amd-ctk cdi generate --output /path/to/custom/amd.json + +Validating CDI Specifications +============================== + +To verify that your CDI specification matches the actual GPU hardware on the system, run: + +.. code-block:: bash + + sudo amd-ctk cdi validate + +This command: + +* Reads the CDI specification from ``/etc/cdi/amd.json`` +* Scans the system for available AMD GPU devices +* Verifies that the devices defined in the specification accurately reflect the hardware present on the host + +**Custom Specification Path** + +To validate a specification at a different location, use the ``--path`` flag: + +.. code-block:: bash + + amd-ctk cdi validate --path /path/to/custom/amd.json + +.. note:: + + The ``amd-ctk`` tool requires appropriate permissions to read and write CDI specification files. When operating on the default location (``/etc/cdi``), it requires elevated privileges, hence ``sudo`` is typically needed. + + If you want to operate on a different user-owned location (using the ``--output`` or ``--path`` flags for generation or validation respectively), ``sudo`` can be omitted, provided the user has necessary read/write permissions for that location. + + When using a custom output location, ensure your container runtime is configured to read CDI specifications from that directory. Most runtimes default to ``/etc/cdi`` and ``/var/run/cdi``. + +.. important:: + + Regenerate the CDI specification whenever you: + + * Add or remove GPU devices + * Modify GPU partitioning or configuration + +Troubleshooting +=============== + +Containers Cannot Access GPUs +------------------------------ + +If containers do not see the expected GPU devices: + +1. **Validate the specification:** + + .. code-block:: bash + + sudo amd-ctk cdi validate + + If the validation fails, it indicates a mismatch between the CDI specification and the actual hardware. You may need to regenerate the specification in such cases. + +2. **Verify runtime configuration:** + + Ensure your container runtime is configured to read CDI specifications from the directory containing ``amd.json``. Check the runtime's CDI configuration settings. + +3. **Check file permissions:** + + .. code-block:: bash + + ls -l /etc/cdi/amd.json + + The file should be readable by the container runtime process. If you're using a custom location, ensure the permissions allow the runtime to access it. + +4. **Regenerate if hardware changed:** + + If you've added, removed, or reconfigured GPUs, regenerate the specification: + + .. code-block:: bash + + sudo amd-ctk cdi generate + +5. **Verify device names:** + + Ensure you're using the correct CDI device names (e.g., ``amd.com/gpu=0``) while requesting devices. + +Validation Errors +----------------- + +If ``amd-ctk cdi validate`` reports errors: + +* Check that GPU devices are properly detected by the system (verify with ``rocm-smi``, ``amd-smi`` or similar tools) +* Ensure GPU drivers are correctly installed +* Regenerate the specification to reflect the current system state diff --git a/docs/index.md b/docs/index.md index dd56ca7..c6ad060 100644 --- a/docs/index.md +++ b/docs/index.md @@ -19,5 +19,6 @@ This documentation site provides information about the AMD Container Toolkit, wh - [Docker Compose](container-runtime/docker-compose.rst) - [Enroot Pyxis Installation](container-runtime/enroot-pyxis-installation.md) - [Support for Docker Swarm](container-runtime/docker-swarm.md) +- [Support for Container Device Interface](container-runtime/cdi-guide.rst) - [GPU Tracker](container-runtime/gpu-tracker.md) - [Release Notes](container-runtime/release-notes.rst) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 7fc9f41..a26649a 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -22,6 +22,8 @@ subtrees: title: Enroot Pyxis Installation - file: container-runtime/docker-swarm.md title: Support for Docker Swarm + - file: container-runtime/cdi-guide.rst + title: Support for Container Device Interface - file: container-runtime/gpu-tracker title: GPU Tracker - file: container-runtime/release-notes.rst