|
| 1 | +.. include:: vars.rst |
| 2 | + |
| 3 | +============================= |
| 4 | +Support for GPUs in OpenStack |
| 5 | +============================= |
| 6 | + |
| 7 | +This guide has been developed for Nvidia GPUs and CentOS 8. |
| 8 | + |
| 9 | +See `Kayobe Ops <https://github.com/stackhpc/kayobe-ops>`_ for |
| 10 | +a playbook implementation of host setup for GPU. |
| 11 | + |
| 12 | +BIOS Configuration Requirements |
| 13 | +------------------------------- |
| 14 | + |
| 15 | +On an Intel system: |
| 16 | + |
| 17 | +* Enable `VT-x` in the BIOS for virtualisation support. |
| 18 | +* Enable `VT-d` in the BIOS for IOMMU support. |
| 19 | + |
| 20 | +Hypervisor Configuration Requirements |
| 21 | +------------------------------------- |
| 22 | + |
| 23 | +Find the GPU device IDs |
| 24 | +^^^^^^^^^^^^^^^^^^^^^^^ |
| 25 | + |
| 26 | +From the host OS, use ``lspci -nn`` to find the PCI vendor ID and |
| 27 | +device ID for the GPU device and supporting components. These are |
| 28 | +4-digit hex numbers. |
| 29 | + |
| 30 | +For example: |
| 31 | + |
| 32 | +.. code-block:: text |
| 33 | +
|
| 34 | + 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller]) |
| 35 | + 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1) |
| 36 | +
|
| 37 | +In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``. |
| 38 | + |
| 39 | +Alternatively, for an Nvidia Quadro RTX 6000: |
| 40 | + |
| 41 | +.. code-block:: yaml |
| 42 | +
|
| 43 | + # NVIDIA Quadro RTX 6000/8000 PCI device IDs |
| 44 | + vendor_id: "10de" |
| 45 | + display_id: "1e30" |
| 46 | + audio_id: "10f7" |
| 47 | + usba_id: "1ad6" |
| 48 | + usba_class: "0c0330" |
| 49 | + usbc_id: "1ad7" |
| 50 | + usbc_class: "0c8000" |
| 51 | +
|
| 52 | +These parameters will be used for device-specific configuration. |
| 53 | + |
| 54 | +Kernel Ramdisk Reconfiguration |
| 55 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 56 | + |
| 57 | +The ramdisk loaded during kernel boot can be extended to include the |
| 58 | +vfio PCI drivers and ensure they are loaded early in system boot. |
| 59 | + |
| 60 | +.. code-block:: yaml |
| 61 | +
|
| 62 | + - name: Template dracut config |
| 63 | + blockinfile: |
| 64 | + path: /etc/dracut.conf.d/gpu-vfio.conf |
| 65 | + block: | |
| 66 | + add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd" |
| 67 | + owner: root |
| 68 | + group: root |
| 69 | + mode: 0660 |
| 70 | + create: true |
| 71 | + become: true |
| 72 | + notify: |
| 73 | + - Regenerate initramfs |
| 74 | + - reboot |
| 75 | +
|
| 76 | +The handler for regenerating the Dracut initramfs is: |
| 77 | + |
| 78 | +.. code-block:: yaml |
| 79 | +
|
| 80 | + - name: Regenerate initramfs |
| 81 | + shell: |- |
| 82 | + #!/bin/bash |
| 83 | + set -eux |
| 84 | + dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r) |
| 85 | + become: true |
| 86 | +
|
| 87 | +Kernel Boot Parameters |
| 88 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 89 | + |
| 90 | +Set the following kernel parameters by adding to |
| 91 | +``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in |
| 92 | +``/etc/default/grub.conf``. We can use the |
| 93 | +`stackhpc.grubcmdline <https://galaxy.ansible.com/stackhpc/grubcmdline>`_ |
| 94 | +role from Ansible Galaxy: |
| 95 | + |
| 96 | +.. code-block:: yaml |
| 97 | +
|
| 98 | + - name: Add vfio-pci.ids kernel args |
| 99 | + include_role: |
| 100 | + name: stackhpc.grubcmdline |
| 101 | + vars: |
| 102 | + kernel_cmdline: |
| 103 | + - intel_iommu=on |
| 104 | + - iommu=pt |
| 105 | + - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}" |
| 106 | + kernel_cmdline_remove: |
| 107 | + - iommu |
| 108 | + - intel_iommu |
| 109 | + - vfio-pci.ids |
| 110 | +
|
| 111 | +Kernel Device Management |
| 112 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 113 | + |
| 114 | +In the hypervisor, we must prevent kernel device initialisation of |
| 115 | +the GPU and prevent drivers from loading for binding the GPU in the |
| 116 | +host OS. We do this using ``udev`` rules: |
| 117 | + |
| 118 | +.. code-block:: yaml |
| 119 | +
|
| 120 | + - name: Template udev rules to blacklist GPU usb controllers |
| 121 | + blockinfile: |
| 122 | + # We want this to execute as soon as possible |
| 123 | + path: /etc/udev/rules.d/99-gpu.rules |
| 124 | + block: | |
| 125 | + #Remove NVIDIA USB xHCI Host Controller Devices, if present |
| 126 | + ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1" |
| 127 | + #Remove NVIDIA USB Type-C UCSI devices, if present |
| 128 | + ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1" |
| 129 | + owner: root |
| 130 | + group: root |
| 131 | + mode: 0644 |
| 132 | + create: true |
| 133 | + become: true |
| 134 | +
|
| 135 | +Kernel Drivers |
| 136 | +^^^^^^^^^^^^^^ |
| 137 | + |
| 138 | +Prevent the ``nouveau`` kernel driver from loading by |
| 139 | +blacklisting the module: |
| 140 | + |
| 141 | +.. code-block:: yaml |
| 142 | +
|
| 143 | + - name: Blacklist nouveau |
| 144 | + blockinfile: |
| 145 | + path: /etc/modprobe.d/blacklist-nouveau.conf |
| 146 | + block: | |
| 147 | + blacklist nouveau |
| 148 | + options nouveau modeset=0 |
| 149 | + mode: 0664 |
| 150 | + owner: root |
| 151 | + group: root |
| 152 | + create: true |
| 153 | + become: true |
| 154 | + notify: |
| 155 | + - reboot |
| 156 | + - Regenerate initramfs |
| 157 | +
|
| 158 | +Ensure that the ``vfio`` drivers are loaded into the kernel on boot: |
| 159 | + |
| 160 | +.. code-block:: yaml |
| 161 | +
|
| 162 | + - name: Add vfio to modules-load.d |
| 163 | + blockinfile: |
| 164 | + path: /etc/modules-load.d/vfio.conf |
| 165 | + block: | |
| 166 | + vfio |
| 167 | + vfio_iommu_type1 |
| 168 | + vfio_pci |
| 169 | + vfio_virqfd |
| 170 | + owner: root |
| 171 | + group: root |
| 172 | + mode: 0664 |
| 173 | + create: true |
| 174 | + become: true |
| 175 | + notify: reboot |
| 176 | +
|
| 177 | +Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot: |
| 178 | + |
| 179 | +.. code-block:: text |
| 180 | +
|
| 181 | + # lsmod | grep vfio |
| 182 | + vfio_pci 49152 0 |
| 183 | + vfio_virqfd 16384 1 vfio_pci |
| 184 | + vfio_iommu_type1 28672 0 |
| 185 | + vfio 32768 2 vfio_iommu_type1,vfio_pci |
| 186 | + irqbypass 16384 5 vfio_pci,kvm |
| 187 | +
|
| 188 | + # lspci -nnk -s 3d:00.0 |
| 189 | + 3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2) |
| 190 | + Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] |
| 191 | + Kernel driver in use: vfio-pci |
| 192 | + Kernel modules: nouveau |
| 193 | +
|
| 194 | +IOMMU should be enabled at kernel level as well - we can verify that on the compute host: |
| 195 | + |
| 196 | +.. code-block:: text |
| 197 | +
|
| 198 | + # docker exec -it nova_libvirt virt-host-validate | grep IOMMU |
| 199 | + QEMU: Checking for device assignment IOMMU support : PASS |
| 200 | + QEMU: Checking if IOMMU is enabled by kernel : PASS |
| 201 | +
|
| 202 | +OpenStack Nova configuration |
| 203 | +---------------------------- |
| 204 | + |
| 205 | +Configure nova-scheduler |
| 206 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 207 | + |
| 208 | +The nova-scheduler service must be configured to enable the ``PciPassthroughFilter`` |
| 209 | +To enable it add it to the list of filters to Kolla-Ansible configuration file: |
| 210 | +``etc/kayobe/kolla/config/nova.conf``, for instance: |
| 211 | + |
| 212 | +.. code-block:: yaml |
| 213 | +
|
| 214 | + [filter_scheduler] |
| 215 | + available_filters = nova.scheduler.filters.all_filters |
| 216 | + enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter |
| 217 | +
|
| 218 | +Configure nova-compute |
| 219 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 220 | + |
| 221 | +Configuration can be applied in flexible ways using Kolla-Ansible's |
| 222 | +methods for `inventory-driven customisation of configuration |
| 223 | +<https://docs.openstack.org/kayobe/latest/configuration/reference/kolla-ansible.html#service-configuration>`_. |
| 224 | +The following configuration could be added to |
| 225 | +``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI |
| 226 | +passthrough of GPU devices for hosts in a group named ``compute_gpu``. |
| 227 | +Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci |
| 228 | +-nn`` can be used here to specify the GPU device(s). |
| 229 | + |
| 230 | +.. code-block:: jinja |
| 231 | +
|
| 232 | + [pci] |
| 233 | + {% raw %} |
| 234 | + {% if inventory_hostname in groups['compute_gpu'] %} |
| 235 | + # We could support multiple models of GPU. |
| 236 | + # This can be done more selectively using different inventory groups. |
| 237 | + # GPU models defined here: |
| 238 | + # NVidia Tesla V100 16GB |
| 239 | + # NVidia Tesla V100 32GB |
| 240 | + # NVidia Tesla P100 16GB |
| 241 | + passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" }, |
| 242 | + { "vendor_id":"10de", "product_id":"1db5" }, |
| 243 | + { "vendor_id":"10de", "product_id":"15f8" }] |
| 244 | + alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } |
| 245 | + alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } |
| 246 | + alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } |
| 247 | + {% endif %} |
| 248 | + {% endraw %} |
| 249 | +
|
| 250 | +Configure nova-api |
| 251 | +^^^^^^^^^^^^^^^^^^ |
| 252 | + |
| 253 | +pci.alias also needs to be configured on the controller. |
| 254 | +This configuration should match the configuration found on the compute nodes. |
| 255 | +Add it to Kolla-Ansible configuration file: |
| 256 | +``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance: |
| 257 | + |
| 258 | +.. code-block:: yaml |
| 259 | +
|
| 260 | + [pci] |
| 261 | + alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } |
| 262 | + alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } |
| 263 | + alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } |
| 264 | +
|
| 265 | +Reconfigure nova service |
| 266 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 267 | + |
| 268 | +.. code-block:: text |
| 269 | +
|
| 270 | + kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks |
| 271 | +
|
| 272 | +Configure a flavor |
| 273 | +^^^^^^^^^^^^^^^^^^ |
| 274 | + |
| 275 | +For example, to request two of the GPUs with alias gpu-p100 |
| 276 | + |
| 277 | +.. code-block:: text |
| 278 | +
|
| 279 | + openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2" |
| 280 | +
|
| 281 | +
|
| 282 | +This can be also defined in the |project_config| repository: |
| 283 | +|project_config_source_url| |
| 284 | + |
| 285 | +add extra_specs to flavor in etc/|project_config|/|project_config|.yml: |
| 286 | + |
| 287 | +.. code-block:: console |
| 288 | + :substitutions: |
| 289 | +
|
| 290 | + admin# cd |base_path|/src/|project_config| |
| 291 | + admin# vim etc/|project_config|/|project_config|.yml |
| 292 | +
|
| 293 | + name: "m1.medium" |
| 294 | + ram: 4096 |
| 295 | + disk: 40 |
| 296 | + vcpus: 2 |
| 297 | + extra_specs: |
| 298 | + "pci_passthrough:alias": "gpu-p100:2" |
| 299 | +
|
| 300 | +Invoke configuration playbooks afterwards: |
| 301 | + |
| 302 | +.. code-block:: console |
| 303 | + :substitutions: |
| 304 | +
|
| 305 | + admin# source |base_path|/src/|kayobe_config|/etc/kolla/public-openrc.sh |
| 306 | + admin# source |base_path|/venvs/|project_config|/bin/activate |
| 307 | + admin# tools/|project_config| --vault-password-file |vault_password_file_path| |
| 308 | +
|
| 309 | +Create instance with GPU passthrough |
| 310 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 311 | + |
| 312 | +.. code-block:: text |
| 313 | +
|
| 314 | + openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci |
| 315 | +
|
| 316 | +Testing GPU in a Guest VM |
| 317 | +------------------------- |
| 318 | + |
| 319 | +The Nvidia drivers must be installed first. For example, on an Ubuntu guest: |
| 320 | + |
| 321 | +.. code-block:: text |
| 322 | +
|
| 323 | + sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440 |
| 324 | +
|
| 325 | +The ``nvidia-smi`` command will generate detailed output if the driver has loaded |
| 326 | +successfully. |
| 327 | + |
| 328 | +Further Reference |
| 329 | +----------------- |
| 330 | + |
| 331 | +For PCI Passthrough and GPUs in OpenStack: |
| 332 | + |
| 333 | +* Consumer-grade GPUs: https://gist.github.com/claudiok/890ab6dfe76fa45b30081e58038a9215 |
| 334 | +* https://www.jimmdenton.com/gpu-offloading-openstack/ |
| 335 | +* https://docs.openstack.org/nova/latest/admin/pci-passthrough.html |
| 336 | +* https://docs.openstack.org/nova/latest/admin/virtual-gpu.html (vGPU only) |
| 337 | +* Tesla models in OpenStack: https://egallen.com/openstack-nvidia-tesla-gpu-passthrough/ |
| 338 | +* https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF |
| 339 | +* https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt |
| 340 | +* https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough |
| 341 | +* https://www.gresearch.co.uk/article/utilising-the-openstack-placement-service-to-schedule-gpu-and-nvme-workloads-alongside-general-purpose-instances/ |
0 commit comments