diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 1fd99d30d..6198edcaa 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -2,6 +2,132 @@ Support for GPUs in OpenStack ============================= +PCI Passthrough +############### + +Prerequisite - BIOS Configuration +--------------------------------- + +On an Intel system: + +* Enable ``VT-x`` in the BIOS for virtualisation support. +* Enable ``VT-d`` in the BIOS for IOMMU support. + +On an AMD system: + +* Enable ``AMD-v`` in the BIOS for virtualisation support. +* Enable ``AMD-Vi`` (also just called ``IOMMU`` on older hardware) in the BIOS + for IOMMU support. + +It may be possible to configure passthrough without these settings, though +stability or performance may be affected. + +Host and Service Configuration +------------------------------ + +PCI passthrough GPU variables can be found in the +``etc/kayobe/stackhpc-compute.yml`` file. + +The ``gpu_group_map`` is a dictionary mapping inventory groups to GPU types. +This is used to determine which GPU types each compute node should pass through +to OpenStack. The keys are group names, the values are a list of GPU types. + +Possible GPU types are defined in the ``stackhpc_gpu_data`` dictionary. It +contains data for many common GPUs. If you have a GPU that is not included, +extend the dictionary following the same pattern. + +The ``resource_name`` is the name that will be used in the flavor extra specs. +These can be overridden e.g. ``a100_80_resource_name: "big_gpu"``. + +Example configuration for three groups containing A100s, V100s, and both: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/stackhpc-compute.yml + + gpu_group_map: + compute_a100: + - a100_80 + compute_v100: + - v100_32 + compute_multi_gpu: + - a100_80 + - v100_32 + +All groups in the ``gpu_group_map`` must also be added to +``kolla_overcloud_inventory_top_level_group_map`` in ``etc/kayobe/kolla.yml``. +Always include the Kayobe defaults unless you know what you are doing. + +When ``gpu_group_map`` is populated, the ``pci-passthrough.yml`` playbook will +be added as a pre-hook to ``kayobe overcloud host configure``. Either run host +configuration or trigger the playbook manually: + +.. code-block:: console + + kayobe overcloud host configure --limit compute_a100,compute_v100,compute_multi_gpu + # OR + kayobe playbook run --playbook $KAYOBE_CONFIG_PATH/ansible/pci-passthrough.yml --limit compute_a100,compute_v100,compute_multi_gpu + +The playbook will apply the necessary configuraion and reboot the hosts if +required. + +Once host configuration is complete, deploy Nova: +.. code-block:: console + + kayobe overcloud service deploy -kt nova + +Create a flavor +--------------- + +For example, to request two of the GPUs with alias **v100_32** + +.. code-block:: text + + openstack flavor set m1.medium-gpu --property "pci_passthrough:alias"="v100_32:2" + +This can be also defined in the openstack-config repository. + +Add extra_specs to flavor in etc/openstack-config/openstack-config.yml: + +.. code-block:: console + + cd src/openstack-config + vim etc/openstack-config/openstack-config.yml + + name: "m1.medium-gpu" + ram: 4096 + disk: 40 + vcpus: 2 + extra_specs: + "pci_passthrough:alias": "v100_32:2" + +Invoke configuration playbooks afterwards: + +.. code-block:: console + + source src/kayobe-config/etc/kolla/public-openrc.sh + source venvs/openstack/bin/activate + tools/openstack-config --vault-password-file + +Create instance with GPU passthrough +------------------------------------ + +.. code-block:: text + + openstack server create --flavor m1.medium-gpu --image ubuntu22.04 --wait test-pci + +Testing GPU in a Guest VM +------------------------- + +The Nvidia drivers must be installed first. For example, on an Ubuntu guest: + +.. code-block:: text + + sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440 + +The ``nvidia-smi`` command will generate detailed output if the driver has +loaded successfully. + + Virtual GPUs ############ @@ -147,7 +273,7 @@ hosts can automatically be mapped to these groups by configuring .. _NVIDIA Role Configuration: Role Configuration -^^^^^^^^^^^^^^^^^^ +------------------ Configure the VGPU devices: @@ -193,7 +319,7 @@ Configure the VGPU devices: .. _NVIDIA Kolla Ansible Configuration: Kolla-Ansible configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------- See upstream documentation: `Kolla Ansible configuration `__ then follow the rest. @@ -241,12 +367,12 @@ You will need to reconfigure nova for this change to be applied: kayobe overcloud service deploy -kt nova --kolla-limit compute_vgpu Openstack flavors -^^^^^^^^^^^^^^^^^ +----------------- See upstream documentation: `OpenStack flavors `__ NVIDIA License Server -^^^^^^^^^^^^^^^^^^^^^ +--------------------- The Nvidia delegated license server is a virtual machine based appliance. You simply need to boot an instance using the image supplied on the NVIDIA Licensing portal. This can be done on the OpenStack cloud itself. The @@ -323,7 +449,7 @@ Booting the VM: Manual VM driver and licence configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------ vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads. The host drivers should already be applied to the hypervisor. @@ -393,7 +519,7 @@ includes the drivers and licencing token. Alternatively, an image can be created using Diskimage Builder. Disk image builder recipe to automatically license VGPU on boot -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------------------------------------------- `stackhpc-image-elements `__ provides a ``nvidia-vgpu`` element to configure the nvidia-gridd service in VGPU mode. This allows you to boot VMs that automatically license themselves. @@ -471,7 +597,7 @@ into your openstack-config repository and vault encrypt it. The ``file`` lookup the file (as shown in the example above). Testing vGPU VMs -^^^^^^^^^^^^^^^^ +---------------- vGPU VMs can be validated using the following test workload. The test should succeed if the VM is correctly licenced and drivers are correctly installed for @@ -531,266 +657,10 @@ Example output: Test passed Changing VGPU device types -^^^^^^^^^^^^^^^^^^^^^^^^^^ +-------------------------- See upstream documentation: `Changing VGPU device types `__ -PCI Passthrough -############### - -This guide has been developed for Nvidia GPUs and CentOS 8. - -See `Kayobe Ops `_ for -a playbook implementation of host setup for GPU. - -BIOS Configuration Requirements -------------------------------- - -On an Intel system: - -* Enable `VT-x` in the BIOS for virtualisation support. -* Enable `VT-d` in the BIOS for IOMMU support. - -Hypervisor Configuration Requirements -------------------------------------- - -Find the GPU device IDs -^^^^^^^^^^^^^^^^^^^^^^^ - -From the host OS, use ``lspci -nn`` to find the PCI vendor ID and -device ID for the GPU device and supporting components. These are -4-digit hex numbers. - -For example: - -.. code-block:: text - - 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller]) - 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1) - -In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``. - -Alternatively, for an Nvidia Quadro RTX 6000: - -.. code-block:: yaml - - # NVIDIA Quadro RTX 6000/8000 PCI device IDs - vendor_id: "10de" - display_id: "1e30" - audio_id: "10f7" - usba_id: "1ad6" - usba_class: "0c0330" - usbc_id: "1ad7" - usbc_class: "0c8000" - -These parameters will be used for device-specific configuration. - -Kernel Ramdisk Reconfiguration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The ramdisk loaded during kernel boot can be extended to include the -vfio PCI drivers and ensure they are loaded early in system boot. - -.. code-block:: yaml - - - name: Template dracut config - blockinfile: - path: /etc/dracut.conf.d/gpu-vfio.conf - block: | - add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd" - owner: root - group: root - mode: 0660 - create: true - become: true - notify: - - Regenerate initramfs - - reboot - -The handler for regenerating the Dracut initramfs is: - -.. code-block:: yaml - - - name: Regenerate initramfs - shell: |- - #!/bin/bash - set -eux - dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r) - become: true - -Kernel Boot Parameters -^^^^^^^^^^^^^^^^^^^^^^ - -Set the following kernel parameters by adding to -``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in -``/etc/default/grub.conf``. We can use the -`stackhpc.grubcmdline `_ -role from Ansible Galaxy: - -.. code-block:: yaml - - - name: Add vfio-pci.ids kernel args - include_role: - name: stackhpc.grubcmdline - vars: - kernel_cmdline: - - intel_iommu=on - - iommu=pt - - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}" - kernel_cmdline_remove: - - iommu - - intel_iommu - - vfio-pci.ids - -Kernel Device Management -^^^^^^^^^^^^^^^^^^^^^^^^ - -In the hypervisor, we must prevent kernel device initialisation of -the GPU and prevent drivers from loading for binding the GPU in the -host OS. We do this using ``udev`` rules: - -.. code-block:: yaml - - - name: Template udev rules to blacklist GPU usb controllers - blockinfile: - # We want this to execute as soon as possible - path: /etc/udev/rules.d/99-gpu.rules - block: | - #Remove NVIDIA USB xHCI Host Controller Devices, if present - ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1" - #Remove NVIDIA USB Type-C UCSI devices, if present - ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1" - owner: root - group: root - mode: 0644 - create: true - become: true - -Kernel Drivers -^^^^^^^^^^^^^^ - -Prevent the ``nouveau`` kernel driver from loading by -blacklisting the module: - -.. code-block:: yaml - - - name: Blacklist nouveau - blockinfile: - path: /etc/modprobe.d/blacklist-nouveau.conf - block: | - blacklist nouveau - options nouveau modeset=0 - mode: 0664 - owner: root - group: root - create: true - become: true - notify: - - reboot - - Regenerate initramfs - -Ensure that the ``vfio`` drivers are loaded into the kernel on boot: - -.. code-block:: yaml - - - name: Add vfio to modules-load.d - blockinfile: - path: /etc/modules-load.d/vfio.conf - block: | - vfio - vfio_iommu_type1 - vfio_pci - vfio_virqfd - owner: root - group: root - mode: 0664 - create: true - become: true - notify: reboot - -Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot: - -.. code-block:: text - - # lsmod | grep vfio - vfio_pci 49152 0 - vfio_virqfd 16384 1 vfio_pci - vfio_iommu_type1 28672 0 - vfio 32768 2 vfio_iommu_type1,vfio_pci - irqbypass 16384 5 vfio_pci,kvm - - # lspci -nnk -s 3d:00.0 - 3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2) - Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] - Kernel driver in use: vfio-pci - Kernel modules: nouveau - -IOMMU should be enabled at kernel level as well - we can verify that on the compute host: - -.. code-block:: text - - # docker exec -it nova_libvirt virt-host-validate | grep IOMMU - QEMU: Checking for device assignment IOMMU support : PASS - QEMU: Checking if IOMMU is enabled by kernel : PASS - -OpenStack Nova configuration ----------------------------- - -See upsteram Nova documentation: `Attaching physical PCI devices to guests `__ - -Configure a flavor -^^^^^^^^^^^^^^^^^^ - -For example, to request two of the GPUs with alias **a1** - -.. code-block:: text - - openstack flavor set m1.medium --property "pci_passthrough:alias"="a1:2" - - -This can be also defined in the openstack-config repository - -add extra_specs to flavor in etc/openstack-config/openstack-config.yml: - -.. code-block:: console - - cd src/openstack-config - vim etc/openstack-config/openstack-config.yml - - name: "m1.medium-gpu" - ram: 4096 - disk: 40 - vcpus: 2 - extra_specs: - "pci_passthrough:alias": "a1:2" - -Invoke configuration playbooks afterwards: - -.. code-block:: console - - source src/kayobe-config/etc/kolla/public-openrc.sh - source venvs/openstack/bin/activate - tools/openstack-config --vault-password-file - -Create instance with GPU passthrough -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: text - - openstack server create --flavor m1.medium-gpu --image ubuntu22.04 --wait test-pci - -Testing GPU in a Guest VM -------------------------- - -The Nvidia drivers must be installed first. For example, on an Ubuntu guest: - -.. code-block:: text - - sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440 - -The ``nvidia-smi`` command will generate detailed output if the driver has loaded -successfully. - Further Reference ----------------- diff --git a/etc/kayobe/ansible/pci-passthrough.yml b/etc/kayobe/ansible/pci-passthrough.yml new file mode 100644 index 000000000..59803ccf3 --- /dev/null +++ b/etc/kayobe/ansible/pci-passthrough.yml @@ -0,0 +1,142 @@ +--- +- name: Enable GPU passthough + hosts: "{{ (gpu_group_map | default({})).keys() }}" + vars: + # This playbook will execute after nodes are deployed + # and before overcloud host configure - we can't assume + # users and venvs exist. + ansible_user: "{{ bootstrap_user }}" + ansible_ssh_common_args: "-o StrictHostKeyChecking=no" + ansible_python_interpreter: "/usr/bin/python3" + vfio_pci_ids: |- + {% set gpu_list = [] %} + {% set output = [] %} + {% for gpu_group in gpu_group_map | dict2items | default([]) %} + {% if gpu_group.key in group_names %} + {% set _ = gpu_list.append(gpu_group.value) %} + {% endif %} + {% endfor %} + {% for item in gpu_list | flatten | unique %} + {% set _ = output.append(stackhpc_gpu_data[item]['vendor_id'] + ':' + stackhpc_gpu_data[item]['product_id']) %} + {% endfor %} + {{ output | join(',') }} + reboot_timeout_s: "{{ 20 * 60 }}" + tasks: + - name: Template dracut config + ansible.builtin.blockinfile: + path: /etc/dracut.conf.d/gpu-vfio.conf + block: | + add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd" + owner: root + group: root + mode: 0660 + create: true + become: true + notify: + - Regenerate initramfs + - reboot + + - name: Add vfio to modules-load.d + ansible.builtin.blockinfile: + path: /etc/modules-load.d/vfio.conf + block: | + vfio + vfio_iommu_type1 + vfio_pci + vfio_virqfd + owner: root + group: root + mode: 0664 + create: true + become: true + notify: reboot + + - name: Blacklist nouveau + ansible.builtin.blockinfile: + path: /etc/modprobe.d/blacklist-nouveau.conf + block: | + blacklist nouveau + options nouveau modeset=0 + mode: 0664 + owner: root + group: root + create: true + become: true + notify: + - reboot + - Regenerate initramfs + + - name: Ignore unsupported model specific registers + # Occasionally, applications running in the VM may crash unexpectedly, + # whereas they would run normally on a physical machine. If, while + # running dmesg -wH, you encounter an error mentioning MSR, the reason + # for those crashes is that KVM injects a General protection fault (GPF) + # when the guest tries to access unsupported Model-specific registers + # (MSRs) - this often results in guest applications/OS crashing. A + # number of those issues can be solved by passing the ignore_msrs=1 + # option to the KVM module, which will ignore unimplemented MSRs. + # source: https://wiki.archlinux.org/index.php/QEMU + ansible.builtin.blockinfile: + path: /etc/modprobe.d/kvm.conf + block: | + options kvm ignore_msrs=Y + # This option is not available in centos 7 as the kernel is too old, + # but it can help with dmesg spam in newer kernels (centos8?). Sample + # dmesg log message: + # [ +0.000002] kvm [8348]: vcpu0, guest rIP: 0xffffffffb0a767fa ignored rdmsr: 0x619 + # options kvm report_ignored_msrs=N + mode: 0664 + owner: root + group: root + create: true + become: true + notify: reboot + + - name: Add vfio-pci.ids kernel args + ansible.builtin.include_role: + name: stackhpc.linux.grubcmdline + vars: + kernel_cmdline: + - intel_iommu=on + - iommu=pt + - "vfio-pci.ids={{ vfio_pci_ids }}" + kernel_cmdline_remove: + - iommu + - intel_iommu + - vfio-pci.ids + + handlers: + - name: Regenerate initramfs (RedHat) + listen: Regenerate initramfs + ansible.builtin.shell: |- + #!/bin/bash + set -eux + dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r) + become: true + changed_when: true + when: ansible_facts.os_family == 'RedHat' + + - name: Regenerate initramfs (Debian) + listen: Regenerate initramfs + ansible.builtin.shell: |- + #!/bin/bash + set -eux + update-initramfs -u -k $(uname -r) + become: true + changed_when: true + when: ansible_facts.os_family == 'Debian' + + - name: Reboot + listen: reboot + become: true + ansible.builtin.reboot: + reboot_timeout: "{{ reboot_timeout_s }}" + search_paths: + # Systems running molly-guard hang waiting for confirmation before rebooting without this. + - /lib/molly-guard + # Default list: + - /sbin + - /bin + - /usr/sbin + - /usr/bin + - /usr/local/sbin diff --git a/etc/kayobe/hooks/overcloud-host-configure/pre.d/pci-passthrough.yml b/etc/kayobe/hooks/overcloud-host-configure/pre.d/pci-passthrough.yml new file mode 120000 index 000000000..ffdf55f6a --- /dev/null +++ b/etc/kayobe/hooks/overcloud-host-configure/pre.d/pci-passthrough.yml @@ -0,0 +1 @@ +../../../ansible/pci-passthrough.yml \ No newline at end of file diff --git a/etc/kayobe/kolla.yml b/etc/kayobe/kolla.yml index 4f64fb775..3f42667e6 100644 --- a/etc/kayobe/kolla.yml +++ b/etc/kayobe/kolla.yml @@ -485,6 +485,24 @@ kolla_build_args: {} # * groups: A list of kayobe ansible groups to map to this kolla-ansible group. # * vars: A dict mapping variable names to values for hosts in this # kolla-ansible group. +# NOTE(Alex-Welsh): If you want to extend the map rather than replace it, you +# must include the Kayobe defaults in the mapping. +# Standard Kayobe defaults: +# compute: +# groups: +# - "compute" +# control: +# groups: +# - "controllers" +# monitoring: +# groups: +# - "controllers" +# network: +# groups: +# - "controllers" +# storage: +# groups: +# - "controllers" #kolla_overcloud_inventory_top_level_group_map: # List of names of top level kolla-ansible groups. Any of these groups which @@ -499,7 +517,9 @@ kolla_build_args: {} # List of names of additional host variables to pass through from kayobe hosts # to kolla-ansible hosts, if set. See also # kolla_overcloud_inventory_pass_through_host_vars_map. -#kolla_overcloud_inventory_pass_through_host_vars_extra: +kolla_overcloud_inventory_pass_through_host_vars_extra: + - stackhpc_gpu_data + - gpu_group_map # List of names of host variables to pass through from kayobe hosts to # kolla-ansible hosts, if set. See also diff --git a/etc/kayobe/kolla/config/nova/nova-api.conf b/etc/kayobe/kolla/config/nova/nova-api.conf new file mode 100644 index 000000000..59e3a6102 --- /dev/null +++ b/etc/kayobe/kolla/config/nova/nova-api.conf @@ -0,0 +1,4 @@ +[pci] +{% for item in gpu_group_map | dict2items | map(attribute='value') | flatten | unique | list %} +alias = { "vendor_id":"{{ stackhpc_gpu_data[item].vendor_id }}", "product_id":"{{ stackhpc_gpu_data[item].product_id }}", "device_type":"{{ stackhpc_gpu_data[item].device_type }}", "name":"{{ stackhpc_gpu_data[item].resource_name }}" } +{% endfor %} diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf new file mode 100644 index 000000000..5f8593dde --- /dev/null +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf @@ -0,0 +1,13 @@ +[pci] +{% raw %} +{% set gpu_list = [] %} +{% for gpu_group in gpu_group_map | dict2items | default([]) %} +{% if gpu_group.key in group_names %} +{% set _ = gpu_list.append(gpu_group.value) %} +{% endif %} +{% endfor %} +{% for item in gpu_list | flatten | unique %} +device_spec = { "vendor_id":"{{ stackhpc_gpu_data[item].vendor_id }}", "product_id":"{{ stackhpc_gpu_data[item].product_id }}" } +alias = { "vendor_id":"{{ stackhpc_gpu_data[item].vendor_id }}", "product_id":"{{ stackhpc_gpu_data[item].product_id }}", "device_type":"{{ stackhpc_gpu_data[item].device_type }}", "name":"{{ stackhpc_gpu_data[item].resource_name }}" } +{% endfor %} +{% endraw %} diff --git a/etc/kayobe/kolla/config/nova/nova-scheduler.conf b/etc/kayobe/kolla/config/nova/nova-scheduler.conf new file mode 100644 index 000000000..f41bd8548 --- /dev/null +++ b/etc/kayobe/kolla/config/nova/nova-scheduler.conf @@ -0,0 +1,7 @@ +[filter_scheduler] +# Default list plus PciPassthroughFilter +# NOTE(Upgrade): defaults may change in each release. Default values can be +# checked here: +# https://docs.openstack.org/nova/latest/configuration/sample-config.html +enabled_filters = ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter +available_filters = nova.scheduler.filters.all_filters diff --git a/etc/kayobe/stackhpc-compute.yml b/etc/kayobe/stackhpc-compute.yml new file mode 100644 index 000000000..5e86b0030 --- /dev/null +++ b/etc/kayobe/stackhpc-compute.yml @@ -0,0 +1,103 @@ +--- +# StackHPC compute node configuration + +# Map of inventory groups to GPU types. +# This is used to determine which GPU types each compute node should pass +# through to OpenStack. +# Keys are group names, values are a list of GPU types. +# Groups must be added to kolla_overcloud_inventory_top_level_group_map +# GPU types must be keys in stackhpc_gpu_data. +# Example GPU group map: +# gpu_group_map: +# compute_a100: +# - a100_80 +# compute_v100: +# - v100_32 +# compute_multi_gpu: +# - a100_80 +# - v100_32 +gpu_group_map: {} + +# Dict mapping GPUs to PCI data. +# Resource names are used to identify the device in placement, and can be +# edited to match deployment-specific naming conventions +# The default list covers many common GPUs, but can be extended as needed. +stackhpc_gpu_data: + # Nvidia H100 SXM5 80GB + h100_80_sxm: + resource_name: "{{ h100_80_sxm_resource_name | default('h100_80_sxm')}}" + vendor_id: "10de" + product_id: "2330" + device_type: "type-PF" + # Nvidia A100 SXM5 80GB + a100_80_sxm: + resource_name: "{{ a100_80_sxm_resource_name | default('a100_80_sxm')}}" + vendor_id: "10de" + product_id: "20b2" + device_type: "type-PF" + # Nvidia A100 SXM5 40GB + a100_40_sxm: + resource_name: "{{ a100_40_sxm_resource_name | default('a100_40_sxm')}}" + vendor_id: "10de" + product_id: "20b0" + device_type: "type-PF" + # Nvidia A100 PCI 80GB + a100_80: + resource_name: "{{ a100_80_resource_name | default('a100_80')}}" + vendor_id: "10de" + product_id: "20b5" + device_type: "type-PF" + # Nvidia A100 PCI 40GB + a100_40: + resource_name: "{{ a100_40_resource_name | default('a100_40')}}" + vendor_id: "10de" + product_id: "20f1" + device_type: "type-PF" + # Nvidia V100 SXM3 32GB + v100_32_sxm3: + resource_name: "{{ v100_32_sxm3_resource_name | default('v100_32_sxm3')}}" + vendor_id: "10de" + product_id: "1db8" + device_type: "type-PCI" + # Nvidia V100 SXM2 32GB + v100_32_sxm2: + resource_name: "{{ v100_32_sxm2_resource_name | default('v100_32_sxm2')}}" + vendor_id: "10de" + product_id: "1db5" + device_type: "type-PCI" + # Nvidia V100 PCI 32GB + v100_32: + resource_name: "{{ v100_32_resource_name | default('v100_32')}}" + vendor_id: "10de" + product_id: "1db6" + device_type: "type-PCI" + # Nvidia RTX A6000 + a6000: + resource_name: "{{ a6000_resource_name | default('a6000')}}" + vendor_id: "10de" + product_id: "2230" + device_type: "type-PCI" + # Nvidia A40 + a40: + resource_name: "{{ a40_resource_name | default('a40')}}" + vendor_id: "10de" + product_id: "2235" + device_type: "type-PF" + # Nvidia T4 + t4: + resource_name: "{{ t4_resource_name | default('t4')}}" + vendor_id: "10de" + product_id: "1eb8" + device_type: "type-PF" + # Nvidia L40 + l40: + resource_name: "{{ l40_resource_name | default('l40')}}" + vendor_id: "10de" + product_id: "26b5" + device_type: "type-PF" + # Nvidia L40s + l40s: + resource_name: "{{ l40s_resource_name | default('l40s')}}" + vendor_id: "10de" + product_id: "26b9" + device_type: "type-PF" diff --git a/releasenotes/notes/pci-passthrough-support-0c7e62585aaf2c23.yaml b/releasenotes/notes/pci-passthrough-support-0c7e62585aaf2c23.yaml new file mode 100644 index 000000000..eae2d774b --- /dev/null +++ b/releasenotes/notes/pci-passthrough-support-0c7e62585aaf2c23.yaml @@ -0,0 +1,8 @@ +--- +features: + - | + Added templates and a playbook to simplify configuration of PCI passthrough + GPUs. GPU types can be mapped to inventory groups with the + ``gpu_group_map`` variable, which will configure the host and Nova + automatically. A list of supported GPUs can be found in + ``etc/kayobe/stackhpc-compute.yml`` under ``stackhpc_gpu_data``.