Merge branch 'NVIDIA:main' into docs/265

mvalsecchi-nv · web-flow · commit ca5798c85fa5 · 2025-10-15T10:37:08.000+09:00
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -138,7 +138,7 @@ publish_docs:
   script:
   - echo "Pushing docs live to https://docs.nvidia.com/datacenter/cloud-native"
   - |+
-    if [[ "${CI_COMMIT_REF_NAME}" =~ (.+)-v([0-9]+\.[0-9]+\.[0-9]+) ]]; then
+    if [[ "${CI_COMMIT_REF_NAME}" =~ (.+)-v([0-9]+\.[0-9]+(\.[a-zA-Z0-9]+)?) ]]; then
       export DOCSET="${BASH_REMATCH[1]}"
       export VERSION="${BASH_REMATCH[2]}"
     fi
@@ -148,7 +148,7 @@ publish_docs:
       exit 1
     fi
   - |+
-    if [[ "${CI_COMMIT_MESSAGE}" =~ $'\n/not-latest\n' ]]; then
+    if [[ "${CI_COMMIT_MESSAGE}" =~ $'/not-latest\n' ]]; then
       export FORCE_LATEST=false
     fi
   - echo "Publishing docs for ${DOCSET} and version ${VERSION}"
diff --git a/container-toolkit/install-guide.md b/container-toolkit/install-guide.md
@@ -44,7 +44,7 @@ where `systemd` cgroup drivers are used that cause containers to lose access to
    Optionally, configure the repository to use experimental packages:
 
    ```console
-   $ sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
+   $ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
    ```
 
 1. Update the packages list from the repository:
diff --git a/gpu-operator/amazon-eks.rst b/gpu-operator/amazon-eks.rst
@@ -102,11 +102,10 @@ without any limitations, you perform the following high-level actions:
   the instance type to meet your needs:
 
   * Table of accelerated computing
-    `instance types <https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing>`_
+    `instance types <https://aws.amazon.com/ec2/instance-types/accelerated-computing/>`_
     for information about GPU model and count, RAM, and storage.
 
-  * Table of
-    `maximum network interfaces <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#enis-acceleratedcomputing>`_
+  * `Maximum IP addresses per network interface <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AvailableIpPerENI.html>`_
     for accelerated computing instance types.
     Make sure the instance type supports enough IP addresses for your workload.
     For example, the ``g4dn.xlarge`` instance type supports ``29`` IP addresses for pods on the node.
@@ -132,7 +131,7 @@ Prerequisites
   and `Configuring the AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html>`_
   in the AWS CLI documentation.
 * You installed the ``eksctl`` CLI if you prefer it as your client application.
-  The CLI is available from https://eksctl.io/introduction/#installation.
+  The CLI is available from https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html#eksctl-install-update.
 * You have the AMI value from https://cloud-images.ubuntu.com/aws-eks/.
 * You have the EC2 instance type to use for your nodes.
 
diff --git a/gpu-operator/custom-driver-params.rst b/gpu-operator/custom-driver-params.rst
@@ -49,7 +49,8 @@ To pass custom parameters, execute the following steps.
 Example using ``nvidia-uvm`` module
 -----------------------------------
 
-This example shows the High Memory Mode being disabled in the ``nvidia-uvm`` module.
+This example shows the Heterogeneous Memory Management (HMM) being disabled in the ``nvidia-uvm`` module.
+Refer to `Simplifying GPU Application Development with Heterogeneous Memory Management <https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/>`_ for more information about HMM.
 
 #. Create a configuration file named ``nvidia-uvm.conf``:
 
diff --git a/gpu-operator/dra-cds.rst b/gpu-operator/dra-cds.rst
@@ -49,7 +49,8 @@ For more detail on the security properties of a ComputeDomain, see `Security <dr
 A deeper dive: related resources
 ================================
 
-For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes, see `this doc <https://docs.google.com/document/d/1PrdDofsPFVJuZvcv-vtlI9n2eAh-YVf_fRQLIVmDwVY/edit?tab=t.0#heading=h.qkogm924v5so>`_ and `this slide deck <https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g28ac369118f_0_1647>`_.
+For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes, refer to the `Kubernetes support for GH200 / GB200 <https://docs.google.com/document/d/1PrdDofsPFVJuZvcv-vtlI9n2eAh-YVf_fRQLIVmDwVY/edit?tab=t.0#heading=h.nfp9friarxam>`_ document
+and the `Supporting GB200 on Kubernetes <https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g373e0ebfa8e_1_142#slide=id.g373e0ebfa8e_1_142>`_ slide deck.
 For an outlook on planned improvements on the ComputeDomain concept, please refer to `this document <https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3>`_.
 
 Details about IMEX and its relationship to NVLink may be found in `NVIDIA's IMEX guide <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>`_, and in `NVIDIA's NVLink guide <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_.
diff --git a/gpu-operator/dra-gpus.rst b/gpu-operator/dra-gpus.rst
@@ -12,7 +12,7 @@ NVIDIA DRA Driver for GPUs
 GPU allocation
 **************
 
-Compared to `traditional GPU allocation <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins/>`_ using coarse-grained count-based requests, the GPU allocation side of this driver enables fine-grained control and powerful features long desired by the community, such as:
+Compared to `traditional GPU allocation <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_ using coarse-grained count-based requests, the GPU allocation side of this driver enables fine-grained control and powerful features long desired by the community, such as:
 
 #. Controlled sharing of individual GPUs between multiple pods and/or containers.
 #. GPU selection via complex constraints expressed via `CEL <https://kubernetes.io/docs/reference/using-api/cel/>`_.
diff --git a/gpu-operator/dra-intro-install.rst b/gpu-operator/dra-intro-install.rst
@@ -48,7 +48,7 @@ Prerequisites
 =============
 
 - Kubernetes v1.32 or newer.
-- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
+- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/tasks/configure-pod-container/assign-resources/set-up-dra-cluster/#enable-dra>`_).
 - `CDI <https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#how-to-configure-cdi>`_ must be enabled in the underlying container runtime (such as containerd or CRI-O).
 - NVIDIA GPU Driver 565 or later.
 
diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst
@@ -168,7 +168,7 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
      - ``true``
 
    * - ``dcgmExporter.service.internalTrafficPolicy``
-     - Specifies the `internalTrafficPolicy <https://kubernetes.io/docs/concepts/services-networking/service/#internal-traffic-policy>`_ for the DCGM Exporter service.
+     - Specifies the `internalTrafficPolicy <https://kubernetes.io/docs/concepts/services-networking/service/#traffic-policies>`_ for the DCGM Exporter service.
        Available values are ``Cluster`` (default) or ``Local``.
      - ``Cluster``
 
diff --git a/gpu-operator/gpu-operator-kubevirt.rst b/gpu-operator/gpu-operator-kubevirt.rst
@@ -70,7 +70,7 @@ Assumptions, constraints, and dependencies
 
 * The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached.
 
-* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the `KubeVirt documentation <https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices>`_ for more information.
+* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the `KubeVirt documentation <https://kubevirt.io/user-guide/compute/host-devices/#listing-permitted-devices>`_ for more information.
 
 * MIG-backed vGPUs are not supported.
 
@@ -512,7 +512,7 @@ Building the NVIDIA vGPU Manager image
 
 This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry.
 
-Download the vGPU Software from the `NVIDIA Licensing Portal <https://nvid.nvidia.com/dashboard/#/dashboard>`_.
+Download the vGPU Software from the `NVIDIA Licensing Portal <https://stg.ui.licensing.nvidia.com/>`_.
 
 * Login to the NVIDIA Licensing Portal and navigate to the **Software Downloads** section.
 * The NVIDIA vGPU Software is located in the **Software Downloads** section of the NVIDIA Licensing Portal.
diff --git a/gpu-operator/gpu-operator-rdma.rst b/gpu-operator/gpu-operator-rdma.rst
@@ -99,7 +99,7 @@ The prerequisites for configuring GPUDirect RDMA or GPUDirect Storage depend on
     * ``pciPassthru.64bitMMIOSizeGB = 128``
 
     For information about configuring the settings, refer to the
-    `Deploy an AI-Ready Enterprise Platform on vSphere 7 <https://core.vmware.com/resource/deploy-ai-ready-vsphere-7#vm-settings-A>`_
+    `Deploy an AI-Ready Enterprise Platform on vSphere 7 <https://www.vmware.com/docs/deploy-an-ai-ready-enterprise-platform-on-vsphere-7-update-2#vm-settings-A>`_
     document from VMWare.
 
 **************************
diff --git a/gpu-operator/install-gpu-operator-nvaie.rst b/gpu-operator/install-gpu-operator-nvaie.rst
@@ -82,7 +82,7 @@ Prerequisites
   in the *NVIDIA License System User Guide* for more information.
 - An NGC CLI API key that is used to create an image pull secret.
   The secret is used to pull the prebuilt vGPU driver image from NVIDIA NGC.
-  Refer to `Generating Your NGC API Key <https://docs.nvidia.com/ngc/gpu-cloud/ngc-private-registry-user-guide/index.html#generating-api-key>`__
+  Refer to `Generating Your NGC API Key <https://docs.nvidia.com/ngc/latest/ngc-private-registry-user-guide.html#prug-generating-personal-api-key>`__
   in the *NVIDIA NGC Private Registry User Guide* for more information.
 
 Procedure
@@ -179,7 +179,7 @@ The following list summarizes the driver branches for each release.
 
 For newer releases, you can confirm the the supported driver branch by performing the following steps:
 
-#. Refer to the `release documentation <https://docs.nvidia.com/ai-enterprise/#release-documentation>`__
+#. Refer to the `NVIDIA AI Enterprise Infra Release Branches <https://docs.nvidia.com/ai-enterprise/#infrastructure-software>`__
    for NVIDIA AI Enterprise and access the documentation for your release.
 
 #. In the release notes, identify the supported NVIDIA Data Center GPU Driver branch.
diff --git a/gpu-operator/install-gpu-operator-vgpu.rst b/gpu-operator/install-gpu-operator-vgpu.rst
@@ -104,29 +104,25 @@ Perform the following steps to build and push a container image that includes th
 
    .. code-block:: console
 
-      $ git clone https://gitlab.com/nvidia/container-images/driver
+      $ git clone https://github.com/NVIDIA/gpu-driver-container
 
    .. code-block:: console
 
-      $ cd driver
+      $ cd gpu-driver-container
 
-#. Change directory to the operating system name and version under the driver directory:
+#. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file to the operating system version you want to build the driver container for:
 
-   .. code-block:: console
-
-      $ cd ubuntu20.04
-
-   For Red Hat OpenShift Container Platform, use a directory that includes ``rhel`` in the directory name.
-
-#. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file:
+   Copy ``<local-driver-download-directory>/\*-grid.run`` and ``vgpuDriverCatalog.yaml`` to ``ubuntu22.04/drivers/``.
 
    .. code-block:: console
 
-      $ cp <local-driver-download-directory>/*-grid.run drivers/
+      $ cp <local-driver-download-directory>/*-grid.run ubuntu22.04/drivers/
 
    .. code-block:: console
 
-      $ cp vgpuDriverCatalog.yaml drivers/
+      $ cp vgpuDriverCatalog.yaml ubuntu22.04/drivers/
+
+   For Red Hat OpenShift Container Platform, use a directory that includes ``rhel`` in the directory name.
 
 #. Set environment variables for building the driver container image.
 
@@ -141,35 +137,17 @@ Perform the following steps to build and push a container image that includes th
 
      .. code-block:: console
 
-        $ export OS_TAG=ubuntu20.04
+        $ export OS_TAG=ubuntu22.04
 
      The value must match the guest operating system version.
      For Red Hat OpenShift Container Platform, specify ``rhcos4.<x>`` where ``x`` is the supported minor OCP version.
      Refer to :ref:`Supported Operating Systems and Kubernetes Platforms` for the list of supported OS distributions.
 
-   - Specify the driver container image tag such as ``1.0.0``:
-
-     .. code-block:: console
-
-        $ export VERSION=1.0.0
-
-     The specified value can be any user-defined value.
-     The value is used to install the Operator in a subsequent step.
-
-   - Specify the version of the CUDA base image to use when building the driver container:
-
-     .. code-block:: console
-
-        $ export CUDA_VERSION=11.8.0
-
-     The CUDA version only specifies the base image used to build the driver container.
-     The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container.
-
-   - Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal and append ``-grid``:
+   - Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal:
 
      .. code-block:: console
 
-        $ export VGPU_DRIVER_VERSION=525.60.13-grid
+        $ export VGPU_DRIVER_VERSION=580.95.05
 
      The Operator automatically selects the compatible guest driver version from the drivers bundled with the ``driver`` image.
      If you disable the version check by specifying ``--build-arg DISABLE_VGPU_VERSION_CHECK=true`` when you build the driver image,
@@ -179,12 +157,7 @@ Perform the following steps to build and push a container image that includes th
 
    .. code-block:: console
 
-      $ sudo docker build \
-          --build-arg DRIVER_TYPE=vgpu \
-          --build-arg DRIVER_VERSION=$VGPU_DRIVER_VERSION \
-          --build-arg CUDA_VERSION=$CUDA_VERSION \
-          --build-arg TARGETARCH=amd64 \  # amd64 or arm64
-          -t ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG} .
+      $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make build-vgpuguest-${OS_TAG}
 
 #. Push the driver container image to your private registry.
 
@@ -200,7 +173,7 @@ Perform the following steps to build and push a container image that includes th
 
       .. code-block:: console
 
-         $ sudo docker push ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG}
+         $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make push-vgpuguest-${OS_TAG}
 
 
 **************************************************************************************
@@ -274,7 +247,7 @@ Install the Operator
           -n gpu-operator --create-namespace \
           nvidia/gpu-operator \
           --set driver.repository=${PRIVATE_REGISTRY} \
-          --set driver.version=${VERSION} \
+          --set driver.version=${VGPU_DRIVER_VERSION} \
           --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
           --set driver.licensingConfig.configMapName=licensing-config
 
diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst
@@ -167,7 +167,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
 .. note::
 
    - Driver version could be different with NVIDIA vGPU, as it depends on the driver
-     version downloaded from the `NVIDIA vGPU Software Portal  <https://nvid.nvidia.com/dashboard/#/dashboard>`_.
+     version downloaded from the `NVIDIA Licensing Portal  <https://ui.licensing.nvidia.com>`_.
    - The GPU Operator is supported on all active NVIDIA data center production drivers.
-     Refer to `Supported Drivers and CUDA Toolkit Versions <https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers>`_
+     Refer to `Supported Drivers and CUDA Toolkit Versions <https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions>`_
      for more information.
diff --git a/gpu-operator/microsoft-aks.rst b/gpu-operator/microsoft-aks.rst
@@ -48,8 +48,8 @@ When you follow this approach, you can install the Operator without any special
 considerations or arguments.
 Refer to :ref:`Install NVIDIA GPU Operator`.
 
-For more information about this preview feature, see
-`Skip GPU driver installation (preview) <https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?source=recommendations&tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation-preview>`__
+For more information about this feature, see
+`Skip GPU driver installation <https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?source=recommendations&tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation>`__
 in the Azure Kubernetes Service documentation.
 
 
diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst
@@ -173,12 +173,13 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
     | NVIDIA T400             | Turing                 |
     +-------------------------+------------------------+
 
-     .. note::
+    .. note::
 
       NVIDIA RTX PRO 6000 Blackwell Server Edition notes:
-        * Driver versions 575.57.08 or later is required.  
-        * MIG is not supported on the 575.57.08 driver release.
-        * You must disable High Memory Mode (HMM) in UVM by :ref:`Customizing NVIDIA GPU Driver Parameters during Installation`.
+
+      * Driver versions 575.57.08 or later is required.  
+      * MIG is not supported on the 575.57.08 driver release.
+      * In cases where CUDA init fails, you may need to disable Heterogeneous Memory Management (HMM) in UVM by :ref:`Customizing NVIDIA GPU Driver Parameters during Installation`.
 
   .. tab-item:: B-series Products
 
@@ -192,9 +193,9 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
     | NVIDIA HGX GB200 NVL72  | NVIDIA Blackwell       |
     +-------------------------+------------------------+
 
-     .. note::
+    .. note::
 
-       * HGX B200 requires a driver container version of 570.133.20 or later.
+      * HGX B200 requires a driver container version of 570.133.20 or later.
 
 
 .. _gpu-operator-arm-platforms:
@@ -462,6 +463,7 @@ The GPU Operator has been validated in the following scenarios:
          - 1.29---1.33
          -
 
+.. _supported-precompiled-drivers:
 
 Supported Precompiled Drivers
 -----------------------------
diff --git a/gpu-operator/precompiled-drivers.rst b/gpu-operator/precompiled-drivers.rst
@@ -41,7 +41,7 @@ Limitations and Restrictions
 ============================
 
 * Support for deploying the driver containers with precompiled drivers is limited to
-  hosts with the Ubuntu 22.04 operating system and x86_64 architecture.
+  hosts with the x86_64 architecture and operating system versions listed in the :ref:`supported-precompiled-drivers` table.
 
   For information about using precompiled drivers with OpenShift Container Platform,
   refer to :external+ocp:doc:`gpu-operator-with-precompiled-drivers`.
diff --git a/mig/mig-examples.rst b/mig/mig-examples.rst
@@ -41,7 +41,7 @@ Concurrent Job Launch
 Now, let's try a more complex example. In this example, we will use Argo Workflows to launch concurrent 
 jobs on MIG devices. In this example, the A100 has been configured into 2 MIG devices using the: ``3g.20gb`` profile.
 
-First, `install <https://argoproj.github.io/argo-workflows/quick-start/#install-argo-workflows>`_ the Argo Workflows 
+First, `install <https://argo-workflows.readthedocs.io/en/latest/quick-start/#install-argo-workflows>`_ the Argo Workflows 
 components into your Kubernetes cluster. 
 
 .. code-block:: console
diff --git a/openshift/gpu-operator-with-precompiled-drivers.rst b/openshift/gpu-operator-with-precompiled-drivers.rst
@@ -126,7 +126,8 @@ Perform the following steps to build a custom driver image for use with Red Hat
       - Before OCP 4.19: The driver image tag is formed with the suffix ``-rhcos4.17`` (such as with OCP 4.17).
       - Starting OCP 4.19 and later: The driver image tag is formed with the suffix ``-rhel9.6`` (such as with OCP 4.19).
 
-      Refer to `RHEL Versions Utilized by RHEL CoreOS and OCP <https://access.redhat.com/articles/6907891>`_
+      Refer to `OpenShift Container Platform 4.19 Release Notes section 1.4.5 <https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/release_notes/ocp-4-19-release-notes#ocp-4-19-rhcos-split-layers_release-notes>`_,
+      `RHEL Versions Utilized by RHEL CoreOS and OCP <https://access.redhat.com/articles/6907891>`_,
       and `Split RHCOS into layers: /etc/os-release <https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/split-rhcos-into-layers.md#etcos-release>`_
       for more information.
 
diff --git a/openshift/versions.json b/openshift/versions.json
diff --git a/openshift/versions1.json b/openshift/versions1.json
diff --git a/repo.toml b/repo.toml