Skip to content

Commit 41714d5

Browse files
Merge pull request #455 from jamepark4/va-nvidia-mdev
Va nvidia mdev This VA allows for the deployment of an environment that supports providing multiple Mdev types to guests. One of the key workflow differences in this deployment versus other VA/DT's is after the initial deploymentit needs to: leverage CIFMW to install the necessary Nvidia drivers Create a provider.yaml to map traits to resource providers Reboot the computes This workflow is not considered the universally official way of installing/deploying a vGPU available environment since the procedure can change depending on the underlying hardware, but this is the procedure we plan to use when testing with our own equipment. Reviewed-by: Andrew Bays <[email protected]> Reviewed-by: jamepark4 <[email protected]> Reviewed-by: John Fulton <[email protected]>
2 parents dbef3cf + c2db68f commit 41714d5

37 files changed

+1494
-0
lines changed

automation/vars/nvidia-mdev.yaml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
vas:
3+
nvidia-mdev:
4+
stages:
5+
- path: examples/va/nvidia-mdev/control-plane/nncp
6+
wait_conditions:
7+
- >-
8+
oc -n openstack wait nncp
9+
-l osp/nncm-config-type=standard
10+
--for jsonpath='{.status.conditions[0].reason}'=SuccessfullyConfigured
11+
--timeout=60s
12+
values:
13+
- name: network-values
14+
src_file: values.yaml
15+
build_output: nncp.yaml
16+
17+
- path: examples/va/nvidia-mdev/control-plane
18+
wait_conditions:
19+
- >-
20+
oc -n openstack wait osctlplane controlplane --for condition=Ready
21+
--timeout=1200s
22+
values:
23+
- name: network-values
24+
src_file: nncp/values.yaml
25+
- name: service-values
26+
src_file: service-values.yaml
27+
build_output: control-plane.yaml
28+
29+
- path: examples/va/nvidia-mdev/edpm/nodeset
30+
wait_conditions:
31+
- >-
32+
oc -n openstack wait
33+
osdpns openstack-edpm --for condition=SetupReady
34+
--timeout=60m
35+
values:
36+
- name: edpm-nodeset-values
37+
src_file: values.yaml
38+
build_output: nodeset.yaml
39+
40+
- path: examples/va/nvidia-mdev/edpm/deployment
41+
wait_conditions:
42+
- >-
43+
oc -n openstack wait
44+
osdpns openstack-edpm --for condition=Ready
45+
--timeout=60m
46+
values:
47+
- name: edpm-deployment-values
48+
src_file: values.yaml
49+
build_output: deployment.yaml
50+
post_stage_run:
51+
- name: Run phase 1 playbook
52+
type: playbook
53+
# As a reminder, the job needs to set the nvidia driver URL
54+
source: "../../playbooks/nvidia-mdev-phase1.yml"
55+
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
56+
- name: Run phase 2 playbook
57+
type: playbook
58+
source: "../../playbooks/nvidia-mdev-phase2.yml"
59+
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
60+
61+
- path: examples/va/nvidia-mdev/edpm-post-driver/nodeset
62+
wait_conditions:
63+
- >-
64+
oc -n openstack wait
65+
osdpns openstack-edpm --for condition=Ready
66+
--timeout=10m
67+
values:
68+
- name: edpm-provider-values
69+
src_file: values.yaml
70+
build_output: compute-provider-service.yaml
71+
72+
- path: examples/va/nvidia-mdev/edpm-post-driver/deployment
73+
wait_conditions:
74+
- >-
75+
oc -n openstack wait
76+
osdpd edpm-deployment-post-driver --for condition=Ready
77+
--timeout=20m
78+
values:
79+
- name: edpm-deployment-post-driver
80+
src_file: values.yaml
81+
build_output: post-driver-deployment.yaml

examples/va/nvidia-mdev/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
control-plane.yaml

examples/va/nvidia-mdev/README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Validated Architecture - Nvidia-Mdev
2+
3+
This document describes the CR's and deployment workflow to create an
4+
environment with EDPM Compute Nodes capable of supplying Nvidia mediated
5+
devices (Mdevs). Mdevs allow multiple guests to share the same physical GPU
6+
card on the hypervisor. The deployment also takes advantage of defining and
7+
mapping Custom Traits to different resource providers by passing definition via
8+
provider.yaml through a ConfigMap.
9+
10+
## Purpose
11+
12+
This topology is used to primarily verify environments that provide Nvidia
13+
Mdevs and confirm guests are able to take advantage of the resource correctly.
14+
It should be noted that this type of deployment cannot be simulated with nested
15+
virtualization and requires real baremetal hosts.
16+
17+
## Environment
18+
19+
### Nodes
20+
21+
| Role | Machine Type | Count |
22+
| --------------------------- | ------------ | ----- |
23+
| Compact OpenShift | vm | 3 |
24+
| OpenStack Baremetal Compute | Baremetal | 2 |
25+
26+
### Networks
27+
28+
| Name | Type | Interface | CIDR |
29+
| ------------ | -------- | --------- | --------------- |
30+
| Provisioning | untagged | nic1 | 172.23.0.0/24 |
31+
| Machine | untagged | nic2 | 192.168.51.0/20 |
32+
| RH OSP | trunk | nic3 | |
33+
34+
35+
#### VLAN networks in RH OSP
36+
37+
| Name | Type | CIDR |
38+
| ----------- | ----------- | ----------------- |
39+
| ctlplane | untagged | 192.168.122.0/24 |
40+
| internalapi | VLAN tagged | 172.17.0.0/24 |
41+
| storage | VLAN tagged | 172.18.0.0/24 |
42+
| storagemgmt | VLAN tagged | 172.20.0.0/24 |
43+
| tenant | VLAN tagged | 172.19.0.0/24 |
44+
45+
#### Nova Mdev Configuration
46+
47+
To deploy vGPU devices comprised of different types as well as the capacity to
48+
live migrate, you would need the below configuration applied to Nova.
49+
50+
```YAML
51+
---
52+
apiVersion: v1
53+
data:
54+
25-cpu-pinning-nova.conf: |
55+
[libvirt]
56+
live_migration_completion_timeout = 0
57+
live_migration_downtime = 500000
58+
live_migration_downtime_steps = 3
59+
live_migration_downtime_delay = 3
60+
live_migration_permit_post_copy = false
61+
[devices]
62+
enabled_vgpu_types=nvidia-228,nvidia-229
63+
[vgpu_nvidia-228]
64+
device_addresses=0000:82:00.0
65+
[vgpu_nvidia-229]
66+
device_addresses=0000:04:00.0
67+
kind: ConfigMap
68+
metadata:
69+
name: cpu-pinning-nova
70+
namespace: openstack
71+
```
72+
73+
#### Provider.yaml
74+
75+
In order to easily take advantage of multiple Mdev types in an environment when
76+
creating flavors, we can associate traits to specific resource providers. With
77+
provier.yaml we can map those traits and apply them as part of a deployment.
78+
79+
```YAML
80+
---
81+
apiVersion: v1
82+
data:
83+
provider.yaml: |
84+
meta:
85+
schema_version: "1.0"
86+
providers:
87+
- identification:
88+
name: edpm-compute-0.ctlplane.example.com_pci_0000_04_00_0
89+
traits:
90+
additional:
91+
- CUSTOM_NVIDIA_229
92+
- identification:
93+
name: edpm-compute-0.ctlplane.example.com_pci_0000_82_00_0
94+
traits:
95+
additional:
96+
- CUSTOM_NVIDIA_228
97+
- identification:
98+
name: edpm-compute-1.ctlplane.example.com_pci_0000_04_00_0
99+
traits:
100+
additional:
101+
- CUSTOM_NVIDIA_229
102+
- identification:
103+
name: edpm-compute-1.ctlplane.example.com_pci_0000_82_00_0
104+
traits:
105+
additional:
106+
- CUSTOM_NVIDIA_228
107+
kind: ConfigMap
108+
name: compute-provider
109+
namespace: openstack
110+
---
111+
apiVersion: dataplane.openstack.org/v1beta1
112+
kind: OpenStackDataPlaneService
113+
metadata:
114+
name: compute-provider
115+
namespace: openstack
116+
spec:
117+
addCertMounts: false
118+
caCerts: combined-ca-bundle
119+
dataSources:
120+
- configMapRef:
121+
name: compute-provider
122+
- configMapRef:
123+
name: cpu-pinning-nova
124+
- configMapRef:
125+
name: sriov-nova
126+
- secretRef:
127+
name: nova-cell1-compute-config
128+
- secretRef:
129+
name: nova-migration-ssh-key
130+
edpmServiceType: nova
131+
playbook: osp.edpm.nova
132+
tlsCerts:
133+
default:
134+
contents:
135+
- dnsnames
136+
- ips
137+
issuer: osp-rootca-issuer-internal
138+
networks:
139+
- ctlplane
140+
---
141+
apiVersion: dataplane.openstack.org/v1beta1
142+
kind: OpenStackDataPlaneDeployment
143+
metadata:
144+
name: edpm-deployment-post-driver
145+
namespace: openstack
146+
spec:
147+
ansibleExtraVars:
148+
edpm_reboot_strategy: force
149+
nodeSets:
150+
- openstack-edpm
151+
preserveJobs: true
152+
servicesOverride:
153+
- reboot-os
154+
- compute-provider
155+
```
156+
157+
## Stages
158+
All stages must be executed in the order listed below. Everything is required unless otherwise indicated.
159+
160+
1. [Install the OpenStack K8S operators and their dependencies](../../common/)
161+
2. [Configuring networking and deploy the OpenStack control plane](control-plane.md)
162+
3. [Configure and deploy the initial dataplane](edpm-pre.md)
163+
4. [Update Dataplane to deploy necessary vGPU MDev requirements](edpm-post.md)
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Configuring networking and deploy the OpenStack control plane
2+
3+
## Assumptions
4+
5+
- A storage class called `local-storage` should already exist.
6+
7+
## Initialize
8+
9+
Switch to the "openstack" namespace
10+
```
11+
oc project openstack
12+
```
13+
Change to the nvidia-mdev directory
14+
```
15+
cd architecture/examples/va/nvidia-mdev
16+
```
17+
Edit the [control-plance/nncp/values.yaml](control-plane/nncp/values.yaml) and
18+
[control-plane/service-values.yaml](control-plane/service-values.yaml) files to suit
19+
your environment.
20+
```
21+
vi nncp/values.yaml
22+
vi service-values.yaml
23+
```
24+
25+
## Apply node network configuration
26+
27+
Generate the node network configuration
28+
```
29+
kustomize build control-plane/nncp > nncp.yaml
30+
```
31+
Apply the NNCP CRs
32+
```
33+
oc apply -f nncp.yaml
34+
```
35+
Wait for NNCPs to be available
36+
```
37+
oc wait nncp -l osp/nncm-config-type=standard --for jsonpath='{.status.conditions[0].reason}'=SuccessfullyConfigured --timeout=300s
38+
```
39+
40+
## Apply networking and control-plane configuration
41+
42+
Generate the control-plane and networking CRs.
43+
```
44+
kustomize build control-plane > control-plane.yaml
45+
```
46+
Apply the CRs
47+
```
48+
oc apply -f control-plane.yaml
49+
```
50+
51+
Wait for control plane to be available
52+
```
53+
oc wait osctlplane controlplane --for condition=Ready --timeout=600s
54+
```
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
control-plane.yaml

0 commit comments

Comments
 (0)