|
| 1 | +# Validated Architecture - Nvidia-Mdev |
| 2 | + |
| 3 | +This document describes the CR's and deployment workflow to create an |
| 4 | +environment with EDPM Compute Nodes capable of supplying Nvidia mediated |
| 5 | +devices (Mdevs). Mdevs allow multiple guests to share the same physical GPU |
| 6 | +card on the hypervisor. The deployment also takes advantage of defining and |
| 7 | +mapping Custom Traits to different resource providers by passing definition via |
| 8 | +provider.yaml through a ConfigMap. |
| 9 | + |
| 10 | +## Purpose |
| 11 | + |
| 12 | +This topology is used to primarily verify environments that provide Nvidia |
| 13 | +Mdevs and confirm guests are able to take advantage of the resource correctly. |
| 14 | +It should be noted that this type of deployment cannot be simulated with nested |
| 15 | +virtualization and requires real baremetal hosts. |
| 16 | + |
| 17 | +## Environment |
| 18 | + |
| 19 | +### Nodes |
| 20 | + |
| 21 | +| Role | Machine Type | Count | |
| 22 | +| --------------------------- | ------------ | ----- | |
| 23 | +| Compact OpenShift | vm | 3 | |
| 24 | +| OpenStack Baremetal Compute | Baremetal | 2 | |
| 25 | + |
| 26 | +### Networks |
| 27 | + |
| 28 | +| Name | Type | Interface | CIDR | |
| 29 | +| ------------ | -------- | --------- | --------------- | |
| 30 | +| Provisioning | untagged | nic1 | 172.23.0.0/24 | |
| 31 | +| Machine | untagged | nic2 | 192.168.51.0/20 | |
| 32 | +| RH OSP | trunk | nic3 | | |
| 33 | + |
| 34 | + |
| 35 | +#### VLAN networks in RH OSP |
| 36 | + |
| 37 | +| Name | Type | CIDR | |
| 38 | +| ----------- | ----------- | ----------------- | |
| 39 | +| ctlplane | untagged | 192.168.122.0/24 | |
| 40 | +| internalapi | VLAN tagged | 172.17.0.0/24 | |
| 41 | +| storage | VLAN tagged | 172.18.0.0/24 | |
| 42 | +| storagemgmt | VLAN tagged | 172.20.0.0/24 | |
| 43 | +| tenant | VLAN tagged | 172.19.0.0/24 | |
| 44 | + |
| 45 | +#### Nova Mdev Configuration |
| 46 | + |
| 47 | +To deploy vGPU devices comprised of different types as well as the capacity to |
| 48 | +live migrate, you would need the below configuration applied to Nova. |
| 49 | + |
| 50 | +```YAML |
| 51 | +--- |
| 52 | +apiVersion: v1 |
| 53 | +data: |
| 54 | + 25-cpu-pinning-nova.conf: | |
| 55 | + [libvirt] |
| 56 | + live_migration_completion_timeout = 0 |
| 57 | + live_migration_downtime = 500000 |
| 58 | + live_migration_downtime_steps = 3 |
| 59 | + live_migration_downtime_delay = 3 |
| 60 | + live_migration_permit_post_copy = false |
| 61 | + [devices] |
| 62 | + enabled_vgpu_types=nvidia-228,nvidia-229 |
| 63 | + [vgpu_nvidia-228] |
| 64 | + device_addresses=0000:82:00.0 |
| 65 | + [vgpu_nvidia-229] |
| 66 | + device_addresses=0000:04:00.0 |
| 67 | +kind: ConfigMap |
| 68 | +metadata: |
| 69 | + name: cpu-pinning-nova |
| 70 | + namespace: openstack |
| 71 | +``` |
| 72 | +
|
| 73 | +#### Provider.yaml |
| 74 | +
|
| 75 | +In order to easily take advantage of multiple Mdev types in an environment when |
| 76 | +creating flavors, we can associate traits to specific resource providers. With |
| 77 | +provier.yaml we can map those traits and apply them as part of a deployment. |
| 78 | +
|
| 79 | +```YAML |
| 80 | +--- |
| 81 | +apiVersion: v1 |
| 82 | +data: |
| 83 | + provider.yaml: | |
| 84 | + meta: |
| 85 | + schema_version: "1.0" |
| 86 | + providers: |
| 87 | + - identification: |
| 88 | + name: edpm-compute-0.ctlplane.example.com_pci_0000_04_00_0 |
| 89 | + traits: |
| 90 | + additional: |
| 91 | + - CUSTOM_NVIDIA_229 |
| 92 | + - identification: |
| 93 | + name: edpm-compute-0.ctlplane.example.com_pci_0000_82_00_0 |
| 94 | + traits: |
| 95 | + additional: |
| 96 | + - CUSTOM_NVIDIA_228 |
| 97 | + - identification: |
| 98 | + name: edpm-compute-1.ctlplane.example.com_pci_0000_04_00_0 |
| 99 | + traits: |
| 100 | + additional: |
| 101 | + - CUSTOM_NVIDIA_229 |
| 102 | + - identification: |
| 103 | + name: edpm-compute-1.ctlplane.example.com_pci_0000_82_00_0 |
| 104 | + traits: |
| 105 | + additional: |
| 106 | + - CUSTOM_NVIDIA_228 |
| 107 | +kind: ConfigMap |
| 108 | + name: compute-provider |
| 109 | + namespace: openstack |
| 110 | +--- |
| 111 | +apiVersion: dataplane.openstack.org/v1beta1 |
| 112 | +kind: OpenStackDataPlaneService |
| 113 | +metadata: |
| 114 | + name: compute-provider |
| 115 | + namespace: openstack |
| 116 | +spec: |
| 117 | + addCertMounts: false |
| 118 | + caCerts: combined-ca-bundle |
| 119 | + dataSources: |
| 120 | + - configMapRef: |
| 121 | + name: compute-provider |
| 122 | + - configMapRef: |
| 123 | + name: cpu-pinning-nova |
| 124 | + - configMapRef: |
| 125 | + name: sriov-nova |
| 126 | + - secretRef: |
| 127 | + name: nova-cell1-compute-config |
| 128 | + - secretRef: |
| 129 | + name: nova-migration-ssh-key |
| 130 | + edpmServiceType: nova |
| 131 | + playbook: osp.edpm.nova |
| 132 | + tlsCerts: |
| 133 | + default: |
| 134 | + contents: |
| 135 | + - dnsnames |
| 136 | + - ips |
| 137 | + issuer: osp-rootca-issuer-internal |
| 138 | + networks: |
| 139 | + - ctlplane |
| 140 | +--- |
| 141 | +apiVersion: dataplane.openstack.org/v1beta1 |
| 142 | +kind: OpenStackDataPlaneDeployment |
| 143 | +metadata: |
| 144 | + name: edpm-deployment-post-driver |
| 145 | + namespace: openstack |
| 146 | +spec: |
| 147 | + ansibleExtraVars: |
| 148 | + edpm_reboot_strategy: force |
| 149 | + nodeSets: |
| 150 | + - openstack-edpm |
| 151 | + preserveJobs: true |
| 152 | + servicesOverride: |
| 153 | + - reboot-os |
| 154 | + - compute-provider |
| 155 | +``` |
| 156 | +
|
| 157 | +## Stages |
| 158 | +All stages must be executed in the order listed below. Everything is required unless otherwise indicated. |
| 159 | +
|
| 160 | +1. [Install the OpenStack K8S operators and their dependencies](../../common/) |
| 161 | +2. [Configuring networking and deploy the OpenStack control plane](control-plane.md) |
| 162 | +3. [Configure and deploy the initial dataplane](edpm-pre.md) |
| 163 | +4. [Update Dataplane to deploy necessary vGPU MDev requirements](edpm-post.md) |
0 commit comments