diff --git a/README.md b/README.md index 89ab57086..2673c0158 100644 --- a/README.md +++ b/README.md @@ -8,16 +8,19 @@ The product documentation portal can be found at: https://docs.nvidia.com/datace ## Building the Container This step is optional if your only goal is to build the documentation. -As an alternative to building the container, you can run `docker pull registry.gitlab.com/nvidia/cloud-native/cnt-docs:0.4.0`. +As an alternative to building the container, you can run `docker pull ghcr.io/nvidia/cloud-native-docs:0.5.1`. Refer to to find the most recent tag. If you change the `Dockerfile`, update `CONTAINER_RELEASE_IMAGE` in the `gitlab-ci.yml` file to the new tag and build the container. Use the `Dockerfile` in the repository (under the `docker` directory) to generate the custom doc build container. +Refer to to find the most recent tag. 1. Build the container: ```bash + git clone https://github.com/NVIDIA/cloud-native-docs.git + cd cloud-native-docs docker build --pull \ --tag cnt-doc-builder \ --file docker/Dockerfile . @@ -56,8 +59,8 @@ If you are using WSL on Windows, the URL looks like . -Additionally, the Gitlab CI for this project builds the documentation on every merge into the default branch (`master`). -The documentation from the current default branch (`master`) is available at . +The GitHub CI for this project builds the documentation on every merge into the default branch (`main`). +The documentation from the current default branch (`main`) is available at . Documentation in the default branch is under development and unstable. ## Checking for Broken Links @@ -181,7 +184,7 @@ Only tags are published to docs.nvidia.com. For a "do over," push a tag like `gpu-operator-v25.10-2`. - Always tag the openshift docset and for each new gpu-operator docset release. + Always tag the openshift docset for each new gpu-operator docset release. 1. Push the tag to the repository. @@ -203,7 +206,7 @@ If the commit message includes `/not-latest`, then only the documentation in the 1. Update `.github/workflows/docs-build.yaml` and increment the `env.TAG` value. -1. Update `.gitlab-ci.yml` and set the same value--prefixed by `ghcr.io...`--in the `variables.BUILDER_IMAGE` field. +1. Update `.gitlab-ci.yml` and set the same value (prefixed by `ghcr.io...`) in the `variables.BUILDER_IMAGE` field. 1. Optional: [Build the container and docs](#building-the-container) locally and confirm the update works as intended. @@ -215,12 +218,12 @@ If the commit message includes `/not-latest`, then only the documentation in the 1. After you merge the pull request, the `docs-build.yaml` action detects that the newly incremented `env.TAG` container is not in the registry, builds the container with that tag and pushes it to the GitHub registry. - When you tag a commit to publish, GitLab CI pulls image from the `variables.BUILDER_IMAGE` value, + When you tag a commit to publish, GitHub CI pulls image from the `variables.BUILDER_IMAGE` value, builds the documentation, and that HTML is delivered to docs.nvidia.com. ## License and Contributing This documentation repository is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). -Contributions are welcome. Refer to the [CONTRIBUTING.md](https://gitlab.com/nvidia/cloud-native/cnt-docs/-/blob/master/CONTRIBUTING.md) document for more +Contributions are welcome. Refer to the [CONTRIBUTING.md](https://github.com/NVIDIA/cloud-native-docs/blob/main/CONTRIBUTING.md) document for more information on guidelines to follow before contributions can be accepted. diff --git a/container-toolkit/arch-overview.md b/container-toolkit/arch-overview.md index 10835da5b..69735f161 100644 --- a/container-toolkit/arch-overview.md +++ b/container-toolkit/arch-overview.md @@ -78,7 +78,7 @@ This component is included in the `nvidia-container-toolkit` package. This component includes an executable that implements the interface required by a `runC` `prestart` hook. This script is invoked by `runC` after a container has been created, but before it has been started, and is given access to the `config.json` associated with the container -(e.g. this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes +(such as this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes information contained in the `config.json` and uses it to invoke the `nvidia-container-cli` CLI with an appropriate set of flags. One of the most important flags being which specific GPU devices should be injected into the container. @@ -111,7 +111,7 @@ To use Kubernetes with Docker, you need to configure the Docker `daemon.json` to a reference to the NVIDIA Container Runtime and set this runtime as the default. The NVIDIA Container Toolkit contains a utility to update this file as highlighted in the `docker`-specific installation instructions. -See the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions. +Refer to the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions. ### Package Repository @@ -130,7 +130,7 @@ For the different components: :::{note} As of the release of version `1.6.0` of the NVIDIA Container Toolkit the packages for all components are -published to the `libnvidia-container` `repository ` listed above. For older package versions please see the documentation archives. +published to the `libnvidia-container` `repository ` listed above. For older package versions refer to the documentation archives. ::: Releases of the software are also hosted on `experimental` branch of the repository and are graduated to `stable` after test/validation. To get access to the latest diff --git a/container-toolkit/cdi-support.md b/container-toolkit/cdi-support.md index 105c0cdff..de542d36c 100644 --- a/container-toolkit/cdi-support.md +++ b/container-toolkit/cdi-support.md @@ -1,6 +1,7 @@ % Date: November 11 2022 -% Author: elezar +% Author: elezar (elezar@nvidia.com) +% Author: ArangoGutierrez (eduardoa@nvidia.com) % headings (h1/h2/h3/h4/h5) are # * = - @@ -29,54 +30,134 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f - You installed an NVIDIA GPU Driver. -### Procedure +### Automatic CDI Specification Generation -Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`. -The contents of the `/var/run/cdi/` directory are cleared on boot. +As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service: -However, the path to create and use can depend on the container engine that you use. +- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when: + - The NVIDIA Container Toolkit is installed or upgraded + - The NVIDIA GPU drivers are installed or upgraded + - The system is rebooted -1. Generate the CDI specification file: +This ensures that the CDI specifications are up to date for the current driver +and device configuration and that CDI Devices defined in these speciciations are +available when using native CDI support in container engines such as Docker or Podman. - ```console - $ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml - ``` - - The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created. - You can omit the `--output` argument to print the generated specification to `STDOUT`. +Running the following command will give a list of availble CDI Devices: +```console +nvidia-ctk cdi list +``` - *Example Output* +#### Known limitations +The `nvidia-cdi-refresh` service does not currently handle the following situations: - ```output - INFO[0000] Auto-detected mode as "nvml" - INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0 - INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1 - INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128 - INFO[0000] Using driver version xxx.xxx.xx - ... - ``` +- The removal of NVIDIA GPU drivers +- The reconfiguration of MIG devices -1. (Optional) Check the names of the generated devices: +For these scenarios, the regeneration of CDI specifications must be [manually triggered](#manual-cdi-specification-generation). - ```console - $ nvidia-ctk cdi list - ``` +#### Customizing the Automatic CDI Refresh Service +The behavior of the `nvidia-cdi-refresh` service can be customized by adding +environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env` to +affect the behavior of the `nvidia-ctk cdi generate` command. - The following example output is for a machine with a single GPU that does not support MIG. +As an example, to enable debug logging the configuration file should be updated +as follows: +```bash +# /etc/nvidia-container-toolkit/cdi-refresh.env +NVIDIA_CTK_DEBUG=1 +``` - ```output - INFO[0000] Found 9 CDI devices - nvidia.com/gpu=all - nvidia.com/gpu=0 - ``` +For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation. ```{important} -You must generate a new CDI specification after any of the following changes: +Modifications to the environment file required a systemd reload and restarting the +service to take effect +``` + +```console +$ sudo systemctl daemon-reload +$ sudo systemctl restart nvidia-cdi-refresh.service +``` + +#### Managing the CDI Refresh Service + +The `nvidia-cdi-refresh` service consists of two systemd units: + +- `nvidia-cdi-refresh.path`: Monitors for changes to the system and triggers the service. +- `nvidia-cdi-refresh.service`: Generates the CDI specifications for all available devices based on + the default configuration and any overrides in the environment file. + +These services can be managed using standard systemd commands. + +When working as expected, the `nvidia-cdi-refresh.path` service will be enabled and active, and the +`nvidia-cdi-refresh.service` will be enabled and have run at least once. For example: + +```console +$ sudo systemctl status nvidia-cdi-refresh.path +● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events + Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled) + Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago + Triggers: ● nvidia-cdi-refresh.service +``` + +```console +$ sudo systemctl status nvidia-cdi-refresh.service +○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file + Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled) + Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago +TriggeredBy: ● nvidia-cdi-refresh.path + Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS) + Main PID: 1317511 (code=exited, status=0/SUCCESS) + CPU: 562ms +... +``` + +If these are not enabled as expected, they can be enabled by running: + +```console +$ sudo systemctl enable --now nvidia-cdi-refresh.path +$ sudo systemctl enable --now nvidia-cdi-refresh.service +``` + +#### Troubleshooting CDI Specification Generation and Resolution + +If CDI specifications for available devices are not generated / updated as expected, it is +recommended that the logs for the `nvidia-cdi-refresh.service` be checked. This can be +done by running: + +```console +$ sudo journalctl -u nvidia-cdi-refresh.service +``` + +In most cases, restarting the service should be sufficient to trigger the (re)generation +of CDI specifications: + +```console +$ sudo systemctl restart nvidia-cdi-refresh.service +``` -- You change the device or CUDA driver configuration. -- You use a location such as `/var/run/cdi` that is cleared on boot. +Running: -A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded. +```console +$ nvidia-ctk --debug cdi list +``` +will show a list of available CDI Devices as well as any errors that may have +occurred when loading CDI Specifications from `/etc/cdi` or `/var/run/cdi`. + +### Manual CDI Specification Generation + +As of the NVIDIA Container Toolkit `v1.18.0` the recommended mechanism to regenerate CDI specifications is to restart the `nvidia-cdi-refresh.service`: + +```console +$ sudo systemctl restart nvidia-cdi-refresh.service +``` + +If this does not work, or more flexibility is required, the `nvidia-ctk cdi generate` command +can be used directly: + +```console +$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml ``` ## Running a Workload with CDI diff --git a/container-toolkit/docker-specialized.md b/container-toolkit/docker-specialized.md index 77ab31bf6..fab8d6674 100644 --- a/container-toolkit/docker-specialized.md +++ b/container-toolkit/docker-specialized.md @@ -206,7 +206,7 @@ The supported constraints are provided below: - constraint on the compute architectures of the selected GPUs. * - ``brand`` - - constraint on the brand of the selected GPUs (e.g. GeForce, Tesla, GRID). + - constraint on the brand of the selected GPUs (such as GeForce, Tesla, GRID). ``` Multiple constraints can be expressed in a single environment variable: space-separated constraints are ORed, diff --git a/container-toolkit/index.md b/container-toolkit/index.md index 6c07e25c4..de4a35d14 100644 --- a/container-toolkit/index.md +++ b/container-toolkit/index.md @@ -35,5 +35,5 @@ The NVIDIA Container Toolkit is a collection of libraries and utilities enabling ## License The NVIDIA Container Toolkit (and all included components) is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) and -contributions are accepted with a Developer Certificate of Origin (DCO). See the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for +contributions are accepted with a Developer Certificate of Origin (DCO). Refer to the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for more information. diff --git a/container-toolkit/install-guide.md b/container-toolkit/install-guide.md index b28fc71d1..cc7edb54d 100644 --- a/container-toolkit/install-guide.md +++ b/container-toolkit/install-guide.md @@ -21,7 +21,7 @@ Alternatively, you can install the driver by [downloading](https://www.nvidia.co ```{note} There is a [known issue](troubleshooting.md#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error) on systems where `systemd` cgroup drivers are used that cause containers to lose access to requested GPUs when -`systemctl daemon reload` is run. Please see the troubleshooting documentation for more information. +`systemctl daemon reload` is run. Refer to the troubleshooting documentation for more information. ``` (installing-with-apt)= @@ -31,6 +31,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to ```{note} These instructions [should work](./supported-platforms.md) for any Debian-derived distribution. ``` +1. Install the prerequisites for the instructions below: + ```console + $ sudo apt-get update && sudo apt-get install -y --no-install-recommends \ + curl \ + gnupg2 + ``` 1. Configure the production repository: @@ -78,6 +84,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to These instructions [should work](./supported-platforms.md) for many RPM-based distributions. ``` +1. Install the prerequisites for the instructions below: + ```console + $ sudo dnf install -y \ + curl + ``` + 1. Configure the production repository: ```console @@ -186,8 +198,10 @@ follow these steps: $ sudo nvidia-ctk runtime configure --runtime=containerd ``` - The `nvidia-ctk` command modifies the `/etc/containerd/config.toml` file on the host. - The file is updated so that containerd can use the NVIDIA Container Runtime. + By default, the `nvidia-ctk` command creates a `/etc/containerd/conf.d/99-nvidia.toml` + drop-in config file and modifies (or creates) the `/etc/containerd/config.toml` file + to ensure that the `imports` config option is updated accordingly. The drop-in file + ensures that containerd can use the NVIDIA Container Runtime. 1. Restart containerd: @@ -201,7 +215,7 @@ No additional configuration is needed. You can just run `nerdctl run --gpus=all`, with root or without root. You do not need to run the `nvidia-ctk` command mentioned above for Kubernetes. -See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md). +Refer to the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md) for more information. ### Configuring CRI-O @@ -211,8 +225,8 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/ $ sudo nvidia-ctk runtime configure --runtime=crio ``` - The `nvidia-ctk` command modifies the `/etc/crio/crio.conf` file on the host. - The file is updated so that CRI-O can use the NVIDIA Container Runtime. + By default, the `nvidia-ctk` command creates a `/etc/crio/conf.d/99-nvidia.toml` + drop-in config file. The drop-in file ensures that CRI-O can use the NVIDIA Container Runtime. 1. Restart the CRI-O daemon: @@ -229,7 +243,6 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/ For Podman, NVIDIA recommends using [CDI](./cdi-support.md) for accessing NVIDIA devices in containers. - ## Next Steps - [](./sample-workload.md) \ No newline at end of file diff --git a/container-toolkit/release-notes.md b/container-toolkit/release-notes.md index e3ab5bfbd..3b0d72594 100644 --- a/container-toolkit/release-notes.md +++ b/container-toolkit/release-notes.md @@ -8,6 +8,65 @@ This document describes the new features, improvements, fixes and known issues for the NVIDIA Container Toolkit. +## NVIDIA Container Toolkit 1.18.0 + +This release of the NVIDIA Container Toolkit `v1.18.0` is feature release with the following high-level changes: +- The default mode of the NVIDIA Container Runtime has been updated to make use + of a just-in-time-generated CDI specification instead of defaulting to the `legacy` mode. +- Added a systemd unit to generate CDI specifications for available devices automatically. This allows + native CDI support in container engines such as Docker and Podman to be used without additional steps. + +### Deprecation Notices +- The OCI `hook`-based config mode for cri-o is now deprecated. Updating the cri-o config through a + drop-in config file is now the recommended mechanism to configure this container engine. +- The `chmod` CDI hook is deprecated. It was implemented as a workaround for `crun` issue that has + been resolved for some time now. The inclusion of this hook can still be + triggered when explicitly generating CDI specifications. +- The `legacy` mode of the NVIDIA Container Runtime is deprecated. It is no longer the _default_ mode + when using the `nvidia-container-runtime` is used. It is still supported for use cases where it is + _required_. + +### Packaging Changes +- The Container Toolkit now requires that the version of the `libnvidia-container*` libraries _exactly_ match the version of the `nvidia-container-toolkit*` packages. + version of the `nvidia-container-toolkit*` packages. +- This release no longer publishes `ppc64le` packages. + +### Fixes and Features +- Added automatic generation of CDI specifications for available devices. +- Update the behaviour of the `update-ldcache` hook to ALWAYS create an ldcache in the container + even if ldconfig is not present in the container being run. +- Disable the injection of the `chmod` CDI hook by default. The inclusion of this hook can still be + triggered when explicitly generating CDI specifications. +- The generated CDI specification will include `.so` (development) symlinks for ALL driver libraries + if these exist on the host. +- The `nvidia-ctk cdi generate` command loads select settings from the `config.toml` file when generating + CDI specifications. +- Allow CDI hooks to be explicitly disabled or enabled when using the `nvidia-ctk cdi generate` command + or the `nvcdi` API. + +#### Enhancements to libnvidia-container +- Add clock_gettime to the set of allowed syscalls under seccomp. This allows ldconfig from distributions + such as Arch Linux to be run in the container. + +#### Enhancements to container-toolkit Container Images +- Switched to a single image (based on a distroless base image) for all target plaforms. +- Default to use drop-in config files to add `nvidia` runtime definitions to containerd and cri-o. + +### Included Packages + +The following packages are included: + +- `nvidia-container-toolkit 1.18.0` +- `nvidia-container-toolkit-base 1.18.0` +- `libnvidia-container-tools 1.18.0` +- `libnvidia-container1 1.18.0` + +The following `container-toolkit` containers are included: + +- `nvcr.io/nvidia/k8s/container-toolkit:v1.18.0` +- `nvcr.io/nvidia/k8s/container-toolkit:v1.18.0-packaging` + + ## NVIDIA Container Toolkit 1.17.8 This release of the NVIDIA Container Toolkit `v1.17.8` is a bugfix release. @@ -44,7 +103,7 @@ This release of the NVIDIA Container Toolkit `v1.17.7` is a bugfix and minor fea ### Fixes and Features - Fixed mode detection on Thor-based systems. With this change, the runtime mode correctly resolves to `csv`. - Fixed the resolution of libraries in the LDCache on ARM. This fixes CDI spec generation on ARM-based systems using NVML. -- Added a `nvidia-container-runtime-modes.legacy.cuda-compat-mode` option to provide finer control of how CUDA Forward Compatibility is handled. The default value (`ldconfig`) fixes CUDA Compatibility Support in cases where only the NVIDIA Container Runtime Hook is used (e.g. the Docker `--gpus` command line flag). +- Added a `nvidia-container-runtime-modes.legacy.cuda-compat-mode` option to provide finer control of how CUDA Forward Compatibility is handled. The default value (`ldconfig`) fixes CUDA Compatibility Support in cases where only the NVIDIA Container Runtime Hook is used (such as the Docker `--gpus` command line flag). - Improved the `update-ldcache` hook to run in isolated namespaces. This improves hook security. @@ -255,7 +314,7 @@ The following packages are included: - `libnvidia-container-tools 1.17.2` - `libnvidia-container1 1.17.2` -The following `container-toolkit` conatiners are included: +The following `container-toolkit` containers are included: - `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubi8` - `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubuntu20.04` (also as `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2`) @@ -327,7 +386,7 @@ The following `container-toolkit` conatiners are included: - Added support for requesting IMEX channels as volume mounts. - Added a `disable-imex-channel-creation` feature flag to disable the creation of IMEX channel device nodes when creating a container. - Added IMEX channel device nodes to the CDI specifications in `management` mode. -- Added the creation of select driver symlinks (e.g. `libcuda.so`) in CDI specification generation to match the behavior in the `legacy` mode. +- Added the creation of select driver symlinks (such as `libcuda.so`) in CDI specification generation to match the behavior in the `legacy` mode. ### Enhancements to container-toolkit Container Images @@ -370,7 +429,7 @@ The following `container-toolkit` conatiners are included: ### Fixes and Features - Excluded `libnvidia-allocator` from graphics mounts. This fixes a bug that leaks mounts when a container is started with bi-directional mount propagation. -- Used empty string for default `runtime-config-override`. This removes a redundant warning for runtimes (e.g. Docker) where this is not applicable. +- Used empty string for default `runtime-config-override`. This removes a redundant warning for runtimes (such as Docker) where this is not applicable. ### Enhancements to container-toolkit Container Images @@ -807,7 +866,7 @@ The following `container-toolkit` containers are included: ### Fixes and Features -- Fixed a bug which would cause the update of an ldcache in the container to fail for images that do no use ldconfig (e.g. `busybox`). +- Fixed a bug which would cause the update of an ldcache in the container to fail for images that do no use ldconfig (such as `busybox`). - Fixed a bug where a failure to determine the CUDA driver version would cause the container to fail to start if `NVIDIA_DRIVER_CAPABILITIES` included `graphics` or `display` on Debian systems. - Fixed CDI specification generation on Debian systems. @@ -1001,7 +1060,7 @@ Note that this release does not include an update to `nvidia-docker2` and is com - Add `cdi` mode to NVIDIA Container Runtime - Add discovery of GPUDirect Storage (`nvidia-fs*`) devices if the `NVIDIA_GDS` environment variable of the container is set to `enabled` - Add discovery of MOFED Infiniband devices if the `NVIDIA_MOFED` environment variable of the container is set to `enabled` -- Add `nvidia-ctk runtime configure` command to configure the Docker config file (e.g. `/etc/docker/daemon.json`) for use with the NVIDIA Container Runtime. +- Add `nvidia-ctk runtime configure` command to configure the Docker config file (such as `/etc/docker/daemon.json`) for use with the NVIDIA Container Runtime. #### specific to libnvidia-container @@ -1086,7 +1145,7 @@ The following packages have also been updated to depend on `nvidia-container-too - Bump `libtirpc` to `1.3.2` - Fix bug when running host ldconfig using glibc compiled with a non-standard prefix - Add `libcudadebugger.so` to list of compute libraries -- \[WSL2\] Fix segmentation fault on WSL2s system with no adpaters present (e.g. `/dev/dxg` missing) +- \[WSL2\] Fix segmentation fault on WSL2s system with no adpaters present (such as `/dev/dxg` missing) - Ignore pending MIG mode when checking if a device is MIG enabled - \[WSL2\] Fix bug where `/dev/dxg` is not mounted when `NVIDIA_DRIVER_CAPABILITIES` does not include "compute" diff --git a/container-toolkit/sample-workload.md b/container-toolkit/sample-workload.md index 3b19550a7..fe6be7444 100644 --- a/container-toolkit/sample-workload.md +++ b/container-toolkit/sample-workload.md @@ -21,7 +21,7 @@ you can verify your installation by running a sample workload. ## Running a Sample Workload with Podman -After you install and configura the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver, +After you install and configure the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver, you can verify your installation by running a sample workload. - Run a sample CUDA container: diff --git a/container-toolkit/supported-platforms.md b/container-toolkit/supported-platforms.md index af7401240..cc7efabee 100644 --- a/container-toolkit/supported-platforms.md +++ b/container-toolkit/supported-platforms.md @@ -22,17 +22,16 @@ Recent NVIDIA Container Toolkit releases are tested and expected to work on thes | Ubuntu 24.04 | X | | X | -## Please report issues +## Report issues Our qualification-testing procedures are constantly evolving and we might miss -certain problems. Please -[report](https://github.com/NVIDIA/nvidia-container-toolkit/issues) issues in +certain problems. [Report](https://github.com/NVIDIA/nvidia-container-toolkit/issues) issues in particular as they occur on a platform listed above. ## Other Linux distributions -Releases may work on more platforms than indicated in the table above (e.g., on distribution versions older and newer than listed). +Releases may work on more platforms than indicated in the table above (such as on distribution versions older and newer than listed). Give things a try and we invite you to [report](https://github.com/NVIDIA/nvidia-container-toolkit/issues) any issue observed even if your Linux distribution is not listed. ---- diff --git a/container-toolkit/troubleshooting.md b/container-toolkit/troubleshooting.md index c8588ddfd..813e0c80d 100644 --- a/container-toolkit/troubleshooting.md +++ b/container-toolkit/troubleshooting.md @@ -124,9 +124,9 @@ Review the SELinux policies on your system. ## Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" -When using the NVIDIA Container Runtime Hook (i.e. the Docker `--gpus` flag or +When using the NVIDIA Container Runtime Hook (that is, the Docker `--gpus` flag or the NVIDIA Container Runtime in `legacy` mode) to inject requested GPUs and driver -libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. `runc`) being aware of these changes. +libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (such as `runc`) being aware of these changes. The result is that updates to the container may remove access to the requested GPUs. When the container loses access to the GPU, you will see the following error message from the console output: diff --git a/container-toolkit/versions.json b/container-toolkit/versions.json index aef68cb71..e75f6b450 100644 --- a/container-toolkit/versions.json +++ b/container-toolkit/versions.json @@ -1,7 +1,10 @@ { - "latest": "1.17.8", + "latest": "1.18.0", "versions": [ + { + "version": "1.18.0" + }, { "version": "1.17.8" }, diff --git a/container-toolkit/versions1.json b/container-toolkit/versions1.json index bed065609..11387480a 100644 --- a/container-toolkit/versions1.json +++ b/container-toolkit/versions1.json @@ -1,6 +1,10 @@ [ { "preferred": "true", + "url": "../1.18.0", + "version": "1.18.0" + }, + { "url": "../1.17.8", "version": "1.17.8" }, diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index e0098a994..4fa422858 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -284,7 +284,7 @@ To view all the options, run ``helm show values nvidia/gpu-operator``. - Specifies the default type of workload for the cluster, one of ``container``, ``vm-passthrough``, or ``vm-vgpu``. Setting ``vm-passthrough`` or ``vm-vgpu`` can be helpful if you plan to run all or mostly virtual machines in your cluster. - Refer to :doc:`KubeVirt `, :doc:`Kata Containers `, or :doc:`Confidential Containers `. + Refer to :doc:`KubeVirt `, or :doc:`Kata Containers `. - ``container`` * - ``toolkit.enabled`` @@ -435,7 +435,7 @@ If you want to use custom driver container images, such as version 465.27, then you can build a custom driver container image. Follow these steps: - Rebuild the driver container by specifying the ``$DRIVER_VERSION`` argument when building the Docker image. For - reference, the driver container Dockerfiles are available on the Git repository at https://gitlab.com/nvidia/container-images/driver. + reference, the driver container Dockerfiles are available on the Git repository at https://github.com/NVIDIA/gpu-driver-container/. - Build the container using the appropriate Dockerfile. For example: .. code-block:: console diff --git a/gpu-operator/gpu-operator-confidential-containers.rst b/gpu-operator/gpu-operator-confidential-containers.rst deleted file mode 100644 index 0073e446d..000000000 --- a/gpu-operator/gpu-operator-confidential-containers.rst +++ /dev/null @@ -1,809 +0,0 @@ -.. license-header - SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. - SPDX-License-Identifier: Apache-2.0 - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - -.. headings (h1/h2/h3/h4/h5) are # * = - - -################################################## -GPU Operator with Confidential Containers and Kata -################################################## - - -***************************************** -About Support for Confidential Containers -***************************************** - -.. note:: Technology Preview features are not supported in production environments - and are not functionally complete. - Technology Preview features provide early access to upcoming product features, - enabling customers to test functionality and provide feedback during the development process. - These releases may not have any documentation, and testing is limited. - -Confidential containers is the cloud-native approach of confidential computing. -Confidential computing extends the practice of securing data in transit and data at rest by -adding the practice of securing data in use. - -Confidential computing is a technology that isolates sensitive data in NVIDIA GPUs and a protected CPU enclave during processing. -Confidential computing relies on hardware features such as Intel SGX, Intel TDX, and AMD SEV to provide the *trusted execution environment* (TEE). -The TEE provides embedded encryption keys and an embedded attestation mechanism to ensure that keys are only accessible by authorized application code. - -The following high-level diagram shows some fundamental concepts for confidential containers with the NVIDIA GPU Operator: - -- containerd is configured to run a Kata runtime to start virtual machines. -- Kata starts the virtual machines using an NVIDIA optimized Linux kernel and NVIDIA provided initial RAM disk -- Before the containers run in the virtual machine, a guest pre-start hook runs the local verifier - that is part of the NVIDIA Attestation SDK. - -.. figure:: ./graphics/gpu-op-confidential-containers.svg - :width: 920px - - High-Level Logical Diagram of Software Components and Communication Paths - -************ -Requirements -************ - -Refer to the *Confidential Computing Deployment Guide* at the -`https://docs.nvidia.com/confidential-computing `__ website -for information about supported NVIDIA GPUs, such as the NVIDIA Hopper H100. - -The following topics in the deployment guide apply to a cloud-native environment: - -* Hardware selection and initial hardware configuration, such as BIOS settings. - -* Host operating system selection, initial configuration, and validation. - -The remaining configuration topics in the deployment guide do not apply to a cloud-native environment. -NVIDIA GPU Operator performs the actions that are described in these topics. - - -*********************** -Key Software Components -*********************** - -NVIDIA GPU Operator brings together the following software components to -simplify managing the software required for confidential computing and deploying confidential container workloads: - -Confidential Containers Operator - The Operator manages installing and deploying a runtime that can run Kata Containers with QEMU. - -NVIDIA Kata Manager for Kubernetes - GPU Operator deploys NVIDIA Kata Manager for Kubernetes, ``k8s-kata-manager``. - The manager performs the following functions: - - * Manages the ``kata-qemu-nvidia-gpu-snp`` runtime class. - * Configures containerd to use the runtime class. - * Manages the Kata artifacts such as Linux kernel images and initial RAM disks. - -NVIDIA Confidential Computing Manager for Kubernetes - GPU Operator deploys the manager, ``k8s-cc-manager``, to set the confidential computing mode on the NVIDIA GPUs. - -Node Feature Discovery (NFD) - When you install NVIDIA GPU Operator for confidential computing, you must specify the ``nfd.nodefeaturerules=true`` option. - This option directs the Operator to install node feature rules that detect CPU security features and the NVIDIA GPU hardware. - You can confirm the rules are installed by running ``kubectl get nodefeaturerules nvidia-nfd-node-featurerules``. - - On nodes that have an NVIDIA Hopper family GPU and either Intel TDX or AMD SEV-SNP, NFD adds labels to the node - such as ``"feature.node.kubernetes.io/cpu-security.sev.snp.enabled": "true"`` and ``"nvidia.com/cc.capable": "true"``. - NVIDIA GPU Operator only deploys the operands for confidential containers on nodes that have the - ``"nvidia.com/cc.capable": "true"`` label. - - -About NVIDIA Confidential Computing Manager -=========================================== - -You can set the default confidential computing mode of the NVIDIA GPUs by setting the -``ccManager.defaultMode=`` option. -The default value is ``off``. -You can set this option when you install NVIDIA GPU Operator or afterward by modifying the -``cluster-policy`` instance of the ``ClusterPolicy`` object. - -When you change the mode, the manager performs the following actions: - -* Evicts the other GPU Operator operands from the node. - - However, the manager does not drain user workloads. - You must make sure ensure that no user workloads running on the node before you change the mode. - -* Unbinds the GPU from the VFIO PCI device driver. - -* Changes the mode and resets the GPU. - -* Reschedules the other GPU Operator operands. - - -NVIDIA Confidential Computing Manager Configuration -=================================================== - -The following part of the cluster policy shows the fields related to the manager: - -.. code-block:: yaml - - ccManager: - enabled: true - defaultMode: "off" - repository: nvcr.io/nvidia/cloud-native - image: k8s-cc-manager - version: v0.1.0 - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - env: - - name: CC_CAPABLE_DEVICE_IDS - value: "0x2331,0x2322" - resources: {} - - -**************************** -Limitations and Restrictions -**************************** - -* GPUs are available to containers as a single GPU in passthrough mode only. - Multi-GPU passthrough and vGPU are not supported. - -* Support is limited to initial installation and configuration only. - Upgrade and configuration of existing clusters to configure confidential computing is not supported. - -* Support for confidential computing environments is limited to the implementation described on this page. - -* NVIDIA supports the Operator and confidential computing with the containerd runtime only. - -* The Operator supports performing local attestation only. - - -******************************* -Cluster Topology Considerations -******************************* - -You can configure all the worker nodes in your cluster for confidential containers or you configure some -nodes for confidential containers and the others for traditional containers. -Consider the following example. - -Node A is configured to run traditional containers. - -Node B is configured to run confidential containers. - -Node A receives the following software components: - -- ``NVIDIA Driver Manager for Kubernetes`` -- to install the data-center driver. -- ``NVIDIA Container Toolkit`` -- to ensure that containers can access GPUs. -- ``NVIDIA Device Plugin for Kubernetes`` -- to discover and advertise GPU resources to kubelet. -- ``NVIDIA DCGM and DCGM Exporter`` -- to monitor GPUs. -- ``NVIDIA MIG Manager for Kubernetes`` -- to manage MIG-capable GPUs. -- ``Node Feature Discovery`` -- to detect CPU, kernel, and host features and label worker nodes. -- ``NVIDIA GPU Feature Discovery`` -- to detect NVIDIA GPUs and label worker nodes. - -Node B receives the following software components: - -- ``NVIDIA Kata Manager for Kubernetes`` -- to manage the NVIDIA artifacts such as the - NVIDIA optimized Linux kernel image and initial RAM disk. -- ``NVIDIA Confidential Computing Manager for Kubernetes`` -- to manage the confidential - computing mode of the NVIDIA GPU on the node. -- ``NVIDIA Sandbox Device Plugin`` -- to discover and advertise the passthrough GPUs to kubelet. -- ``NVIDIA VFIO Manager`` -- to load the vfio-pci device driver and bind it to all GPUs on the node. -- ``Node Feature Discovery`` -- to detect CPU security features, NVIDIA GPUs, and label worker nodes. - - -************* -Prerequisites -************* - -* Refer to the *Confidential Computing Deployment Guide* for the following prerequisites: - - * You selected and configured your hardware and BIOS to support confidential computing. - - * You installed and configured an operating system to support confidential computing. - - * You validated that the Linux kernel is SNP-aware. - -* Your hosts are configured to enable hardware virtualization and Access Control Services (ACS). - With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). - Enabling these features is typically performed by configuring the host BIOS. - -* Your hosts are configured to support IOMMU. - - If the output from running ``ls /sys/kernel/iommu_groups`` includes ``0``, ``1``, and so on, - then your host is configured for IOMMU. - - If the host is not configured or you are unsure, add the ``intel_iommu=on`` Linux kernel command-line argument. - For most Linux distributions, you add the argument to the ``/etc/default/grub`` file: - - .. code-block:: text - - ... - GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau" - ... - - On Ubuntu systems, run ``sudo update-grub`` after making the change to configure the bootloader. - On other systems, you might need to run ``sudo dracut`` after making the change. - Refer to the documentation for your operating system. - Reboot the host after configuring the bootloader. - -* You have a Kubernetes cluster and you have cluster administrator privileges. - - -****************************************** -Overview of Installation and Configuration -****************************************** - -Installing and configuring your cluster to support the NVIDIA GPU Operator with confidential containers is as follows: - -#. Label the worker nodes that you want to use with confidential containers. - - This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster. - Alternatively, you can set the default sandbox workload to ``vm-passthrough`` to run confidential containers on all worker nodes when you install the GPU Operator. - -#. Install the Confidential Containers Operator. - - This step installs the Operator and also the Kata Containers runtime that NVIDIA uses for confidential containers. - -#. Install the NVIDIA GPU Operator. - - You install the Operator and specify options to deploy the operands that are required for confidential containers. - -After installation, you can change the confidential computing mode and run a sample workload. - -.. |project-name| replace:: Confidential Containers - -.. start-install-coco-operator - -******************************************** -Install the Confidential Containers Operator -******************************************** - -Perform the following steps to install and verify the Confidential Containers Operator: - -#. Label the nodes to run virtual machines in containers. - Label only the nodes that you want to run with |project-name|. - - .. code-block:: console - - $ kubectl label node nvidia.com/gpu.workload.config=vm-passthrough - -#. Set the Operator version in an environment variable: - - .. code-block:: console - - $ export VERSION=v0.7.0 - -#. Install the Operator: - - .. code-block:: console - - $ kubectl apply -k "github.com/confidential-containers/operator/config/release?ref=${VERSION}" - - *Example Output* - - .. code-block:: output - - namespace/confidential-containers-system created - customresourcedefinition.apiextensions.k8s.io/ccruntimes.confidentialcontainers.org created - serviceaccount/cc-operator-controller-manager created - role.rbac.authorization.k8s.io/cc-operator-leader-election-role created - clusterrole.rbac.authorization.k8s.io/cc-operator-manager-role created - clusterrole.rbac.authorization.k8s.io/cc-operator-metrics-reader created - clusterrole.rbac.authorization.k8s.io/cc-operator-proxy-role created - rolebinding.rbac.authorization.k8s.io/cc-operator-leader-election-rolebinding created - clusterrolebinding.rbac.authorization.k8s.io/cc-operator-manager-rolebinding created - clusterrolebinding.rbac.authorization.k8s.io/cc-operator-proxy-rolebinding created - configmap/cc-operator-manager-config created - service/cc-operator-controller-manager-metrics-service created - deployment.apps/cc-operator-controller-manager create - -#. Optional: View the pods and services in the ``confidential-containers-system`` namespace: - - .. code-block:: console - - $ kubectl get pod,svc -n confidential-containers-system - - *Example Output* - - .. code-block:: output - - NAME READY STATUS RESTARTS AGE - pod/cc-operator-controller-manager-c98c4ff74-ksb4q 2/2 Running 0 2m59s - - NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE - service/cc-operator-controller-manager-metrics-service ClusterIP 10.98.221.141 8443/TCP 2m59s - -#. Install the sample Confidential Containers runtime by creating the manifests and then editing the node selector so - that the runtime is installed only on the labelled nodes. - - #. Create a local copy of the manifests in a file that is named ``ccruntime.yaml``: - - .. code-block:: console - - $ kubectl apply --dry-run=client -o yaml \ - -k "github.com/confidential-containers/operator/config/samples/ccruntime/default?ref=${VERSION}" > ccruntime.yaml - - #. Edit the ``ccruntime.yaml`` file and set the node selector as follows: - - .. code-block:: yaml - - apiVersion: confidentialcontainers.org/v1beta1 - kind: CcRuntime - metadata: - annotations: - ... - spec: - ccNodeSelector: - matchLabels: - nvidia.com/gpu.workload.config: "vm-passthrough" - ... - - #. Apply the modified manifests: - - .. code-block:: console - - $ kubectl apply -f ccruntime.yaml - - *Example Output* - - .. code-block:: output - - ccruntime.confidentialcontainers.org/ccruntime-sample created - - Wait a few minutes for the Operator to create the base runtime classes. - -#. Optional: View the runtime classes: - - .. code-block:: console - - $ kubectl get runtimeclass - - *Example Output* - - .. code-block:: output - - NAME HANDLER AGE - kata kata 13m - kata-clh kata-clh 13m - kata-clh-tdx kata-clh-tdx 13m - kata-qemu kata-qemu 13m - kata-qemu-sev kata-qemu-sev 13m - kata-qemu-snp kata-qemu-snp 13m - kata-qemu-tdx kata-qemu-tdx 13m - -.. end-install-coco-operator - -******************************* -Install the NVIDIA GPU Operator -******************************* - -Procedure -========= - -Perform the following steps to install the Operator for use with confidential containers: - -#. Add and update the NVIDIA Helm repository: - - .. code-block:: console - - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - - -#. Specify at least the following options when you install the Operator. - If you want to run |project-name| by default on all worker nodes, also specify ``--set sandboxWorkloads.defaultWorkload=vm-passthough``. - - .. code-block:: console - - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version=${version} \ - --set sandboxWorkloads.enabled=true \ - --set kataManager.enabled=true \ - --set ccManager.enabled=true \ - --set nfd.nodefeaturerules=true - - - *Example Output* - - .. code-block:: output - - NAME: gpu-operator - LAST DEPLOYED: Tue Jul 25 19:19:07 2023 - NAMESPACE: gpu-operator - STATUS: deployed - REVISION: 1 - TEST SUITE: None - - -Verification -============ - -#. Verify that the Kata Manager, Confidential Computing Manager, and VFIO Manager operands are running: - - .. code-block:: console - - $ kubectl get pods -n gpu-operator - - *Example Output* - - .. code-block:: output - :emphasize-lines: 5,6,9 - - NAME READY STATUS RESTARTS AGE - gpu-operator-57bf5d5769-nb98z 1/1 Running 0 6m21s - gpu-operator-node-feature-discovery-master-b44f595bf-5sjxg 1/1 Running 0 6m21s - gpu-operator-node-feature-discovery-worker-lwhdr 1/1 Running 0 6m21s - nvidia-cc-manager-yzbw7 1/1 Running 0 3m36s - nvidia-kata-manager-bw5mb 1/1 Running 0 3m36s - nvidia-sandbox-device-plugin-daemonset-cr4s6 1/1 Running 0 2m37s - nvidia-sandbox-validator-9wjm4 1/1 Running 0 2m37s - nvidia-vfio-manager-vg4wp 1/1 Running 0 3m36s - -#. Verify that the ``kata-qemu-nvidia-gpu`` and ``kata-qemu-nvidia-gpu-snp`` runtime classes are available: - - .. code-block:: console - - $ kubectl get runtimeclass - - *Example Output* - - .. code-block:: output - :emphasize-lines: 6, 7 - - NAME HANDLER AGE - kata kata 37m - kata-clh kata-clh 37m - kata-clh-tdx kata-clh-tdx 37m - kata-qemu kata-qemu 37m - kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 96s - kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 96s - kata-qemu-sev kata-qemu-sev 37m - kata-qemu-snp kata-qemu-snp 37m - kata-qemu-tdx kata-qemu-tdx 37m - nvidia nvidia 97s - - -#. Optional: If you have host access to the worker node, you can perform the following steps: - - #. Confirm that the host uses the ``vfio-pci`` device driver for GPUs: - - .. code-block:: console - - $ lspci -nnk -d 10de: - - *Example Output* - - .. code-block:: output - :emphasize-lines: 3 - - 65:00.0 3D controller [0302]: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] (rev xx) - Subsystem: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] - Kernel driver in use: vfio-pci - Kernel modules: nvidiafb, nouveau - - #. Confirm that NVIDIA Kata Manager installed the ``kata-qemu-nvidia-gpu-snp`` runtime class files: - - .. code-block:: console - - $ ls -1 /opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu-snp/ - - *Example Output* - - .. code-block:: output - - 5.19.2.tar.gz - config-5.19.2-109-nvidia-gpu-sev - configuration-kata-qemu-nvidia-gpu-snp.toml - dpkg.sbom.list - kata-ubuntu-jammy-nvidia-gpu.initrd - vmlinuz-5.19.2-109-nvidia-gpu-sev - ... - - -**************************************** -Managing the Confidential Computing Mode -**************************************** - -Three modes are supported: - -- ``on`` -- Enable confidential computing. -- ``off`` -- Disable confidential computing. -- ``devtools`` -- Development mode for software development and debugging. - -You can set a cluster-wide default mode and you can set the mode on individual nodes. -The mode that you set on a node has higher precedence than the cluster-wide default mode. - - -Setting a Cluster-Wide Default Mode -=================================== - -To set a cluster-wide mode, specify the ``ccManager.defaultMode`` field like the following example: - -.. code-block:: console - - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - -p '{"spec": {"ccManager": {"defaultMode": "on"}}}' - - -Setting a Node-Level Mode -========================= - -To set a node-level mode, apply the ``nvidia.com/cc.mode=`` label like the following example: - -.. code-block:: console - - $ kubectl label node nvidia.com/cc.mode=on --overwrite - -The mode that you set on a node has higher precedence than the cluster-wide default mode. - -Verifying a Mode Change -======================= - -To verify that changing the mode was successful, a cluster-wide or node-level change, -view the ``nvidia.com/cc.mode.state`` node label: - -.. code-block:: console - - $ kubectl get node -o json | \ - jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc.mode.state)))' - -The label is set to either ``success`` or ``failed``. - - -********************* -Run a Sample Workload -********************* - -A pod specification for a confidential computing requires the following: - -* Specify the ``kata-qemu-nvidia-gpu-snp`` runtime class. - -* Specify a passthrough GPU resource. - -#. Determine the passthrough GPU resource names: - - .. code-block:: console - - kubectl get nodes -l nvidia.com/gpu.present -o json | \ - jq '.items[0].status.allocatable | - with_entries(select(.key | startswith("nvidia.com/"))) | - with_entries(select(.value != "0"))' - - *Example Output* - - .. code-block:: output - - { - "nvidia.com/GH100_H100_PCIE": "1" - } - -#. Create a file, such as ``cuda-vectoradd-coco.yaml``, like the following example: - - .. code-block:: yaml - :emphasize-lines: 6, 15 - - apiVersion: v1 - kind: Pod - metadata: - name: cuda-vectoradd-coco - annotations: - cdi.k8s.io/gpu: "nvidia.com/pgpu=0" - io.katacontainers.config.hypervisor.default_memory: "16384" - spec: - runtimeClassName: kata-qemu-nvidia-gpu-snp - restartPolicy: OnFailure - containers: - - name: cuda-vectoradd - image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" - resources: - limits: - "nvidia.com/GH100_H100_PCIE": 1 - - The ``io.katacontainers.config.hypervisor.default_memory`` annotation starts the VM with 16 GB of memory. - Modify the value to accommodate your workload. - -#. Create the pod: - - .. code-block:: console - - $ kubectl apply -f cuda-vectoradd-coco.yaml - -#. View the logs from pod: - - .. code-block:: console - - $ kubectl logs -n default cuda-vectoradd-coco - - *Example Output* - - .. code-block:: output - - [Vector addition of 50000 elements] - Copy input data from the host memory to the CUDA device - CUDA kernel launch with 196 blocks of 256 threads - Copy output data from the CUDA device to the host memory - Test PASSED - Done - -#. Delete the pod: - - .. code-block:: console - - $ kubectl delete -f cuda-vectoradd-coco.yaml - - -Refer to :ref:`About the Pod Annotation` for information about the pod annotation. - -Troubleshooting Workloads -========================= - -If the sample workload does not run, confirm that you labelled nodes to run virtual machines in containers: - -.. code-block:: console - - $ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough - -*Example Output* - -.. code-block:: output - - NAME STATUS ROLES AGE VERSION - kata-worker-1 Ready 10d v1.27.3 - kata-worker-2 Ready 10d v1.27.3 - kata-worker-3 Ready 10d v1.27.3 - - - -*********** -Attestation -*********** - -About Attestation -================= - -With confidential computing, *attestation* is the assertion that the hardware and -software is trustworthy. - -The Kata runtime uses the ``kata-ubuntu-jammy-nvidia-gpu.initrd`` initial RAM disk file -that NVIDIA Kata Manager for Kubernetes downloaded from NVIDIA Container Registry, nvcr.io. -The initial RAM disk includes an NVIDIA verifier tool that runs as a container guest pre-start hook. -When the attestation is successful, the GPU is set in the ``Ready`` state. -On failure, containers still start, but CUDA applications fail with a ``system not initialized`` error. - -Refer to *NVIDIA Hopper Confidential Computing Attestation Verifier* at https://docs.nvidia.com/confidential-computing -for more information about attestation. - - -Accessing the VM of a Scheduled Confidential Container -====================================================== - -You do not need to access the VM as a routine task. -Accessing the VM is useful for troubleshooting or performing lower-level verification about the confidential computing mode. - -This task requires host access to the Kubernetes node that is running the container. - -#. Determine the Kubernetes node and pod sandbox ID: - - .. code-block:: console - - $ kubectl describe pod - -#. Access the Kubernetes node. - Using secure shell is typical. - -#. Access the Kata runtime: - - .. code-block:: console - - $ kata-runtime exec - -Viewing the GPU Ready State -=========================== - -After you access the VM, you can run ``nvidia-smi conf-compute -grs``: - -.. code-block:: output - - Confidential Compute GPUs Ready state: ready - - -Viewing the Confidential Computing Mode -======================================= - -After you access the VM, you can run ``nvidia-smi conf-compute -f`` to view the mode: - -.. code-block:: output - - CC status: ON - - -Verifying That Attestation Is Successful -======================================== - -After you access the VM, you can run the following commands to verify that attestation is successful: - -.. code-block:: console - - # source /gpu-attestation/nv-venv/bin/activate - # python3 /gpu-attestation/nv_attestation_sdk/tests/SmallGPUTest.py - -*Example Output* - -.. code-block:: output - - [SmallGPUTest] node name : thisNode1 - [['LOCAL_GPU_CLAIMS', , , '', '', '']] - [SmallGPUTest] call attest() - expecting True - Number of GPUs available : 1 - ----------------------------------- - Fetching GPU 0 information from GPU driver. - VERIFYING GPU : 0 - Driver version fetched : 535.86.05 - VBIOS version fetched : 96.00.5e.00.01 - Validating GPU certificate chains. - GPU attestation report certificate chain validation successful. - The certificate chain revocation status verification successful. - Authenticating attestation report - The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce. - Driver version fetched from the attestation report : 535.86.05 - VBIOS version fetched from the attestation report : 96.00.5e.00.01 - Attestation report signature verification successful. - Attestation report verification successful. - Authenticating the RIMs. - Authenticating Driver RIM - Schema validation passed. - driver RIM certificate chain verification successful. - The certificate chain revocation status verification successful. - driver RIM signature verification successful. - Driver RIM verification successful - Authenticating VBIOS RIM. - RIM Schema validation passed. - vbios RIM certificate chain verification successful. - The certificate chain revocation status verification successful. - vbios RIM signature verification successful. - VBIOS RIM verification successful - Comparing measurements (runtime vs golden) - The runtime measurements are matching with the golden measurements. - GPU is in the expected state. - GPU 0 verified successfully. - attestation result: True - claims list:: {'x-nv-gpu-availability': True, 'x-nv-gpu-attestation-report-available': ... - True - [SmallGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e..."], - {"LOCAL_GPU_CLAIMS": "eyJhbGciOiJIUzI1NiIsInR5cCI..."}] - [SmallGPUTest] call validate_token() - expecting True - True - - -Troubleshooting Attestation -=========================== - -To troubleshoot attestation failures, access the VM and view the logs in the ``/var/log/`` directory. - -To troubleshoot virtual machine failures, access the Kubernetes node and view logs with the ``journalctl`` command. - -.. code-block:: console - - $ sudo journalctl -u containerd -f - -The Kata agent communicates with the virtcontainers library on the host by using the VSOCK port. -The communication is recorded to the system journal on the host. -When you view the logs, refer to logs with a ``kata`` or ``virtcontainers`` prefix. - - -******************** -Additional Resources -******************** - -* NVIDIA Confidential Computing documentation is available at https://docs.nvidia.com/confidential-computing. - -* NVIDIA Verifier Tool is part of the nvTrust project. - Refer to https://github.com/NVIDIA/nvtrust/tree/main/guest_tools/gpu_verifiers/local_gpu_verifier - for more information. - diff --git a/gpu-operator/gpu-operator-kata.rst b/gpu-operator/gpu-operator-kata.rst index bb05c2c35..b8ef34290 100644 --- a/gpu-operator/gpu-operator-kata.rst +++ b/gpu-operator/gpu-operator-kata.rst @@ -223,10 +223,6 @@ Installing and configuring your cluster to support the NVIDIA GPU Operator with This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster. Alternatively, you can set the default sandbox workload to ``vm-passthrough`` to run confidential containers on all worker nodes. -#. Install the Confidential Containers Operator. - - This step installs the Operator and also the Kata Containers runtime that NVIDIA uses for Kata Containers. - #. Install the NVIDIA GPU Operator. You install the Operator and specify options to deploy the operands that are required for Kata Containers. @@ -235,10 +231,6 @@ After installation, you can run a sample workload. .. |project-name| replace:: Kata Containers -.. include:: gpu-operator-confidential-containers.rst - :start-after: start-install-coco-operator - :end-before: end-install-coco-operator - ******************************* Install the NVIDIA GPU Operator diff --git a/gpu-operator/gpu-operator-kubevirt.rst b/gpu-operator/gpu-operator-kubevirt.rst index 09824259a..4e97bcc32 100644 --- a/gpu-operator/gpu-operator-kubevirt.rst +++ b/gpu-operator/gpu-operator-kubevirt.rst @@ -534,46 +534,39 @@ Open a terminal and clone the driver container image repository. .. code-block:: console - $ git clone https://gitlab.com/nvidia/container-images/driver - $ cd driver + $ git clone https://github.com/NVIDIA/gpu-driver-container.git + $ cd gpu-driver-container -Change to the vgpu-manager directory for your OS. We use Ubuntu 20.04 as an example. +#. Copy the NVIDIA vGPU manager from your extracted ZIP file to the operating system version you want to build the image for: + * We use Ubuntu 22.04 as an example. -.. code-block:: console - - $ cd vgpu-manager/ubuntu20.04 - -.. note:: + Copy ``/\*-vgpu-kvm.run`` to ``vgpu-manager/ubuntu22.04/``. - For Red Hat OpenShift, run ``cd vgpu-manager/rhel8`` to use the ``rhel8`` folder instead. + .. code-block:: console -Copy the NVIDIA vGPU Manager from your extracted zip file + $ cp /*-vgpu-kvm.run vgpu-manager/ubuntu22.04/ -.. code-block:: console +.. note:: - $ cp /*-vgpu-kvm.run ./ + For Red Hat OpenShift, use a directory that includes ``rhel`` in the directory name. For example, ``vgpu-manager/rhel8``. | Set the following environment variables: | ``PRIVATE_REGISTRY`` - name of private registry used to store driver image -| ``VERSION`` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal -| ``OS_TAG`` - this must match the Guest OS version. In the following example ``ubuntu20.04`` is used. For Red Hat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version. -| ``CUDA_VERSION`` - CUDA base image version to build the driver image with. +| ``VGPU_HOST_DRIVER_VERSION`` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal +| ``OS_TAG`` - this must match the Guest OS version. In the following example ``ubuntu22.04`` is used. For Red Hat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version. .. code-block:: console - $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=ubuntu20.04 CUDA_VERSION=11.7.1 + $ export PRIVATE_REGISTRY=my/private/registry VGPU_HOST_DRIVER_VERSION=580.82.07 OS_TAG=ubuntu22.04 Build the NVIDIA vGPU Manager image. .. code-block:: console - $ docker build \ - --build-arg DRIVER_VERSION=${VERSION} \ - --build-arg CUDA_VERSION=${CUDA_VERSION} \ - -t ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} . + $ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make build-vgpuhost-${OS_TAG} Push NVIDIA vGPU Manager image to your private registry. .. code-block:: console - $ docker push ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} + $ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make push-vgpuhost-${OS_TAG} diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst index 5b38b4e74..5eecd4d9c 100644 --- a/gpu-operator/index.rst +++ b/gpu-operator/index.rst @@ -56,8 +56,7 @@ KubeVirt Kata Containers - Confidential Containers and Kata - + .. toctree:: :caption: Specialized Networks :titlesonly: diff --git a/gpu-operator/install-gpu-operator-vgpu.rst b/gpu-operator/install-gpu-operator-vgpu.rst index bd9a3c1cb..e80cd573d 100644 --- a/gpu-operator/install-gpu-operator-vgpu.rst +++ b/gpu-operator/install-gpu-operator-vgpu.rst @@ -104,7 +104,7 @@ Perform the following steps to build and push a container image that includes th .. code-block:: console - $ git clone https://github.com/NVIDIA/gpu-driver-container + $ git clone https://github.com/NVIDIA/gpu-driver-container.git .. code-block:: console diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index f1144559f..d07c96305 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -188,6 +188,8 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: +=========================+========================+ | NVIDIA DGX B200 | NVIDIA Blackwell | +-------------------------+------------------------+ + | NVIDIA DGX Spark | NVIDIA Blackwell | + +-------------------------+------------------------+ | NVIDIA HGX B200 | NVIDIA Blackwell | +-------------------------+------------------------+ | NVIDIA HGX GB200 NVL72 | NVIDIA Blackwell | diff --git a/gpu-operator/precompiled-drivers.rst b/gpu-operator/precompiled-drivers.rst index 3b9afcf56..a7a880424 100644 --- a/gpu-operator/precompiled-drivers.rst +++ b/gpu-operator/precompiled-drivers.rst @@ -240,11 +240,11 @@ you can perform the following steps to build and run a container image. .. code-block:: console - $ git clone https://gitlab.com/nvidia/container-images/driver + $ git clone https://github.com/NVIDIA/gpu-driver-container.git .. code-block:: console - $ cd driver + $ cd gpu-driver-container #. Change directory to the operating system name and version under the driver directory: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index 740cbc98e..bf4f33deb 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -38,19 +38,7 @@ Refer to the :ref:`GPU Operator Component Matrix` for a list of software compone 25.3.4 ====== -Fixed Issues ------------- - -* Fixed an issue where the GPU Operator failed to render the nvidia-container-toolkit DaemonSet correctly when a custom value for ``CONTAINERD_SOCKET`` was provided as input. - Specifically, the hostPath volumes were not included in the DaemonSet. - Refer to GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1694 for more details. - -.. _v25.3.3: - -25.3.3 -====== - -.. _v25.3.3-new-features: +.. _v25.3.4-new-features: New Features ------------ @@ -77,6 +65,19 @@ Fixed Issues * Fixed an issue where user-supplied environment variables configured in ClusterPolicy were not getting set in the rendered DaemonSet. User-supplied environment variables now take precedence over environment variables set by the ClusterPolicy controller. + +.. _v25.3.3: + +25.3.3 +====== + +Fixed Issues +------------ + +* Fixed an issue where the GPU Operator failed to render the nvidia-container-toolkit DaemonSet correctly when a custom value for ``CONTAINERD_SOCKET`` was provided as input. + Specifically, the hostPath volumes were not included in the DaemonSet. + Refer to GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1694 for more details. + .. _v25.3.2: 25.3.2 diff --git a/openshift/gpu-operator-with-precompiled-drivers.rst b/openshift/gpu-operator-with-precompiled-drivers.rst index 7e96c284a..de44cacf7 100644 --- a/openshift/gpu-operator-with-precompiled-drivers.rst +++ b/openshift/gpu-operator-with-precompiled-drivers.rst @@ -63,13 +63,13 @@ Perform the following steps to build a custom driver image for use with Red Hat .. code-block:: console - $ git clone https://gitlab.com/nvidia/container-images/driver + $ git clone https://github.com/NVIDIA/gpu-driver-container.git #. Change to the ``rhel8/precompiled`` directory under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory: .. code-block:: console - $ cd driver/rhel8/precompiled + $ cd gpu-driver-container/rhel8/precompiled #. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files such as ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``: diff --git a/openshift/mig-ocp.rst b/openshift/mig-ocp.rst index 3ada8433f..ef65cb36d 100644 --- a/openshift/mig-ocp.rst +++ b/openshift/mig-ocp.rst @@ -108,7 +108,7 @@ The NVIDIA GPU Operator exposes GPUs to Kubernetes as extended resources that ca Version 1.8 and greater of the NVIDIA GPU Operator supports updating the **Strategy** in the ClusterPolicy after deployment. -The `default configmap `_ defines the combination of single (homogeneous) and mixed (heterogeneous) profiles that are supported for A100-40GB, A100-80GB and A30-24GB. +The `default configmap `_ defines the combination of single (homogeneous) and mixed (heterogeneous) profiles that are supported for A100-40GB, A100-80GB and A30-24GB. The configmap allows administrators to declaratively define a set of possible MIG configurations they would like applied to all GPUs on a node. The tables below describe these configurations: @@ -301,7 +301,7 @@ Creating and applying a custom MIG configuration Follow the guidance below to create a new slicing profile. -#. Prepare a custom ``configmap`` resource file for example ``custom_configmap.yaml``. Use the `configmap `_ as guidance to help you build that custom configuration. For more documentation about the file format see `mig-parted `_. +#. Prepare a custom ``configmap`` resource file for example ``custom_configmap.yaml``. Use the `configmap `_ as guidance to help you build that custom configuration. For more documentation about the file format see `mig-parted `_. .. note:: For a list of all supported combinations and placements of profiles on A100 and A30, refer to the section on `supported profiles `_. diff --git a/openshift/openshift-virtualization.rst b/openshift/openshift-virtualization.rst index f07da06d5..747626d8f 100644 --- a/openshift/openshift-virtualization.rst +++ b/openshift/openshift-virtualization.rst @@ -245,31 +245,28 @@ Use the following steps to build the vGPU Manager container and push it to a pri .. code-block:: console - $ git clone https://gitlab.com/nvidia/container-images/driver - $ cd driver + $ git clone https://github.com/NVIDIA/gpu-driver-container.git + $ cd gpu-driver-container -#. Change to the ``vgpu-manager`` directory for your OS: +#. Copy the NVIDIA vGPU manager from your extracted ZIP file to the operating system version you want to build the image for: + * We use RHEL 8 as an example. - .. code-block:: console - - $ cd vgpu-manager/rhel8 - -#. Copy the NVIDIA vGPU Manager from your extracted zip file: + Copy ``/\*-vgpu-kvm.run`` to ``vgpu-manager/rhel8/``. .. code-block:: console - $ cp /*-vgpu-kvm.run ./ + $ cp /*-vgpu-kvm.run vgpu-manager/rhel8/ #. Set the following environment variables. * ``PRIVATE_REGISTRY`` - Name of the private registry used to store the driver image. - * ``VERSION`` - The NVIDIA vGPU Manager version downloaded from the NVIDIA Software Portal. + * ``VGPU_HOST_DRIVER_VERSION`` - The NVIDIA vGPU Manager version downloaded from the NVIDIA Software Portal. * ``OS_TAG`` - This must match the Guest OS version. For RedHat OpenShift, specify ``rhcos4.x`` where _x_ is the supported minor OCP version. .. code-block:: console - $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11 + $ export PRIVATE_REGISTRY=my/private/registry VGPU_HOST_DRIVER_VERSION=580.82.07 OS_TAG=rhcos4.18 .. note:: @@ -280,15 +277,13 @@ Use the following steps to build the vGPU Manager container and push it to a pri .. code-block:: console - $ docker build \ - --build-arg DRIVER_VERSION=${VERSION} \ - -t ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} . + $ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make build-vgpuhost-${OS_TAG} #. Push the NVIDIA vGPU Manager image to your private registry. .. code-block:: console - $ docker push ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} + $ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make push-vgpuhost-${OS_TAG} .. _install-the-gpu-operator: @@ -425,7 +420,7 @@ As a cluster administrator, you can create a ClusterPolicy using the OpenShift C * Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version. * Expand the **NVIDIA vGPU Device Manager config** section and make sure that the **enabled** checkbox is checked. - If you are only using GPU passthrough, you don't need to fill these sections out. + If you are only using GPU passthrough, you do not need to fill these sections out. * Expand the **VFIO Manager config** section and select the **enabled** checkbox. * Optionally, in the **Sandbox Workloads config** section, set **defaultWorkload** to ``vm-passthrough`` if you want passthrough to be the default mode. @@ -687,7 +682,7 @@ Switching vGPU device configuration after one has been successfully applied assu To apply a new configuration after GPU Operator install, simply update the ``nvidia.com/vgpu.config`` node label. -Let's run through an example on a system with two **A10** GPUs. +The following example shows a system with two **A10** GPUs. .. code-block:: console @@ -704,7 +699,7 @@ After installing the GPU Operator as detailed in the previous sections and witho "nvidia.com/NVIDIA_A10-12Q": "4" } -If instead you want to create **A10-4Q** devices, we can label the node like such: +If instead you want to create **A10-4Q** devices, label the node as follows: .. code-block:: console diff --git a/openshift/versions.json b/openshift/versions.json index 09ff3c30a..eb9ca1765 100644 --- a/openshift/versions.json +++ b/openshift/versions.json @@ -1,24 +1,24 @@ { - "latest": "25.3.2", + "latest": "25.3.4", "versions": [ { - "version": "25.3.2" + "version": "25.3.4" }, { - "version": "25.3.1" + "version": "25.3.3" }, { - "version": "25.3.0" + "version": "25.3.2" }, { - "version": "24.9.2" + "version": "25.3.1" }, { - "version": "24.9.1" + "version": "25.3.0" }, { - "version": "24.9.0" + "version": "24.9.2" } ] } diff --git a/openshift/versions1.json b/openshift/versions1.json index ce456edb0..b1673edb6 100644 --- a/openshift/versions1.json +++ b/openshift/versions1.json @@ -1,6 +1,14 @@ [ { "preferred": "true", + "url": "../25.3.4", + "version": "25.3.4" + }, + { + "url": "../25.3.3", + "version": "25.3.3" + }, + { "url": "../25.3.2", "version": "25.3.2" }, @@ -15,13 +23,5 @@ { "url": "../24.9.2", "version": "24.9.2" - }, - { - "url": "../24.9.1", - "version": "24.9.1" - }, - { - "url": "../24.9.0", - "version": "24.9.0" } -] \ No newline at end of file +] diff --git a/partner-validated/index.rst b/partner-validated/index.rst index 24a84eae8..985d70773 100644 --- a/partner-validated/index.rst +++ b/partner-validated/index.rst @@ -74,10 +74,10 @@ You provide the following: * Document and contribute the exact software stack that you self-validated. Refer to the - `PARTNER-VALIDATED-TEMPLATE.rst file `__ + `PARTNER-VALIDATED-TEMPLATE.rst file `__ in the ``partner-validated`` directory of the documentation repository as a starting point. Open a pull request to the repository with your update. - Refer to the `CONTRIBUTING.md file `__ + Refer to the `CONTRIBUTING.md file `__ in the root directory of the documentation repository for information about contributing documentation. * Run the self-validated configuration and then share the outcome with NVIDIA by providing diff --git a/repo.toml b/repo.toml index 7f9ed0afa..dca02eda6 100644 --- a/repo.toml +++ b/repo.toml @@ -102,8 +102,8 @@ project_build_order = [ docs_root = "${root}/container-toolkit" project = "container-toolkit" name = "NVIDIA Container Toolkit" -version = "1.17.8" -source_substitutions = {version = "1.17.8"} +version = "1.18.0" +source_substitutions = {version = "1.18.0"} copyright_start = 2020 redirects = [ { path="concepts.html", target="index.html" }, @@ -226,7 +226,7 @@ output_format = "linkcheck" docs_root = "${root}/openshift" project = "gpu-operator-openshift" name = "NVIDIA GPU Operator on Red Hat OpenShift Container Platform" -version = "25.3.2" +version = "25.3.4" copyright_start = 2020 sphinx_exclude_patterns = [ "get-entitlement.rst",