Skip to content

Commit c2db68f

Browse files
sbauzaabays
authored andcommitted
Add nvidia-mdev mutli-type VA
Co-authored-by: Andrew Bays <[email protected]>
1 parent dbef3cf commit c2db68f

37 files changed

+1494
-0
lines changed

automation/vars/nvidia-mdev.yaml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
vas:
3+
nvidia-mdev:
4+
stages:
5+
- path: examples/va/nvidia-mdev/control-plane/nncp
6+
wait_conditions:
7+
- >-
8+
oc -n openstack wait nncp
9+
-l osp/nncm-config-type=standard
10+
--for jsonpath='{.status.conditions[0].reason}'=SuccessfullyConfigured
11+
--timeout=60s
12+
values:
13+
- name: network-values
14+
src_file: values.yaml
15+
build_output: nncp.yaml
16+
17+
- path: examples/va/nvidia-mdev/control-plane
18+
wait_conditions:
19+
- >-
20+
oc -n openstack wait osctlplane controlplane --for condition=Ready
21+
--timeout=1200s
22+
values:
23+
- name: network-values
24+
src_file: nncp/values.yaml
25+
- name: service-values
26+
src_file: service-values.yaml
27+
build_output: control-plane.yaml
28+
29+
- path: examples/va/nvidia-mdev/edpm/nodeset
30+
wait_conditions:
31+
- >-
32+
oc -n openstack wait
33+
osdpns openstack-edpm --for condition=SetupReady
34+
--timeout=60m
35+
values:
36+
- name: edpm-nodeset-values
37+
src_file: values.yaml
38+
build_output: nodeset.yaml
39+
40+
- path: examples/va/nvidia-mdev/edpm/deployment
41+
wait_conditions:
42+
- >-
43+
oc -n openstack wait
44+
osdpns openstack-edpm --for condition=Ready
45+
--timeout=60m
46+
values:
47+
- name: edpm-deployment-values
48+
src_file: values.yaml
49+
build_output: deployment.yaml
50+
post_stage_run:
51+
- name: Run phase 1 playbook
52+
type: playbook
53+
# As a reminder, the job needs to set the nvidia driver URL
54+
source: "../../playbooks/nvidia-mdev-phase1.yml"
55+
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
56+
- name: Run phase 2 playbook
57+
type: playbook
58+
source: "../../playbooks/nvidia-mdev-phase2.yml"
59+
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
60+
61+
- path: examples/va/nvidia-mdev/edpm-post-driver/nodeset
62+
wait_conditions:
63+
- >-
64+
oc -n openstack wait
65+
osdpns openstack-edpm --for condition=Ready
66+
--timeout=10m
67+
values:
68+
- name: edpm-provider-values
69+
src_file: values.yaml
70+
build_output: compute-provider-service.yaml
71+
72+
- path: examples/va/nvidia-mdev/edpm-post-driver/deployment
73+
wait_conditions:
74+
- >-
75+
oc -n openstack wait
76+
osdpd edpm-deployment-post-driver --for condition=Ready
77+
--timeout=20m
78+
values:
79+
- name: edpm-deployment-post-driver
80+
src_file: values.yaml
81+
build_output: post-driver-deployment.yaml

examples/va/nvidia-mdev/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
control-plane.yaml

examples/va/nvidia-mdev/README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Validated Architecture - Nvidia-Mdev
2+
3+
This document describes the CR's and deployment workflow to create an
4+
environment with EDPM Compute Nodes capable of supplying Nvidia mediated
5+
devices (Mdevs). Mdevs allow multiple guests to share the same physical GPU
6+
card on the hypervisor. The deployment also takes advantage of defining and
7+
mapping Custom Traits to different resource providers by passing definition via
8+
provider.yaml through a ConfigMap.
9+
10+
## Purpose
11+
12+
This topology is used to primarily verify environments that provide Nvidia
13+
Mdevs and confirm guests are able to take advantage of the resource correctly.
14+
It should be noted that this type of deployment cannot be simulated with nested
15+
virtualization and requires real baremetal hosts.
16+
17+
## Environment
18+
19+
### Nodes
20+
21+
| Role | Machine Type | Count |
22+
| --------------------------- | ------------ | ----- |
23+
| Compact OpenShift | vm | 3 |
24+
| OpenStack Baremetal Compute | Baremetal | 2 |
25+
26+
### Networks
27+
28+
| Name | Type | Interface | CIDR |
29+
| ------------ | -------- | --------- | --------------- |
30+
| Provisioning | untagged | nic1 | 172.23.0.0/24 |
31+
| Machine | untagged | nic2 | 192.168.51.0/20 |
32+
| RH OSP | trunk | nic3 | |
33+
34+
35+
#### VLAN networks in RH OSP
36+
37+
| Name | Type | CIDR |
38+
| ----------- | ----------- | ----------------- |
39+
| ctlplane | untagged | 192.168.122.0/24 |
40+
| internalapi | VLAN tagged | 172.17.0.0/24 |
41+
| storage | VLAN tagged | 172.18.0.0/24 |
42+
| storagemgmt | VLAN tagged | 172.20.0.0/24 |
43+
| tenant | VLAN tagged | 172.19.0.0/24 |
44+
45+
#### Nova Mdev Configuration
46+
47+
To deploy vGPU devices comprised of different types as well as the capacity to
48+
live migrate, you would need the below configuration applied to Nova.
49+
50+
```YAML
51+
---
52+
apiVersion: v1
53+
data:
54+
25-cpu-pinning-nova.conf: |
55+
[libvirt]
56+
live_migration_completion_timeout = 0
57+
live_migration_downtime = 500000
58+
live_migration_downtime_steps = 3
59+
live_migration_downtime_delay = 3
60+
live_migration_permit_post_copy = false
61+
[devices]
62+
enabled_vgpu_types=nvidia-228,nvidia-229
63+
[vgpu_nvidia-228]
64+
device_addresses=0000:82:00.0
65+
[vgpu_nvidia-229]
66+
device_addresses=0000:04:00.0
67+
kind: ConfigMap
68+
metadata:
69+
name: cpu-pinning-nova
70+
namespace: openstack
71+
```
72+
73+
#### Provider.yaml
74+
75+
In order to easily take advantage of multiple Mdev types in an environment when
76+
creating flavors, we can associate traits to specific resource providers. With
77+
provier.yaml we can map those traits and apply them as part of a deployment.
78+
79+
```YAML
80+
---
81+
apiVersion: v1
82+
data:
83+
provider.yaml: |
84+
meta:
85+
schema_version: "1.0"
86+
providers:
87+
- identification:
88+
name: edpm-compute-0.ctlplane.example.com_pci_0000_04_00_0
89+
traits:
90+
additional:
91+
- CUSTOM_NVIDIA_229
92+
- identification:
93+
name: edpm-compute-0.ctlplane.example.com_pci_0000_82_00_0
94+
traits:
95+
additional:
96+
- CUSTOM_NVIDIA_228
97+
- identification:
98+
name: edpm-compute-1.ctlplane.example.com_pci_0000_04_00_0
99+
traits:
100+
additional:
101+
- CUSTOM_NVIDIA_229
102+
- identification:
103+
name: edpm-compute-1.ctlplane.example.com_pci_0000_82_00_0
104+
traits:
105+
additional:
106+
- CUSTOM_NVIDIA_228
107+
kind: ConfigMap
108+
name: compute-provider
109+
namespace: openstack
110+
---
111+
apiVersion: dataplane.openstack.org/v1beta1
112+
kind: OpenStackDataPlaneService
113+
metadata:
114+
name: compute-provider
115+
namespace: openstack
116+
spec:
117+
addCertMounts: false
118+
caCerts: combined-ca-bundle
119+
dataSources:
120+
- configMapRef:
121+
name: compute-provider
122+
- configMapRef:
123+
name: cpu-pinning-nova
124+
- configMapRef:
125+
name: sriov-nova
126+
- secretRef:
127+
name: nova-cell1-compute-config
128+
- secretRef:
129+
name: nova-migration-ssh-key
130+
edpmServiceType: nova
131+
playbook: osp.edpm.nova
132+
tlsCerts:
133+
default:
134+
contents:
135+
- dnsnames
136+
- ips
137+
issuer: osp-rootca-issuer-internal
138+
networks:
139+
- ctlplane
140+
---
141+
apiVersion: dataplane.openstack.org/v1beta1
142+
kind: OpenStackDataPlaneDeployment
143+
metadata:
144+
name: edpm-deployment-post-driver
145+
namespace: openstack
146+
spec:
147+
ansibleExtraVars:
148+
edpm_reboot_strategy: force
149+
nodeSets:
150+
- openstack-edpm
151+
preserveJobs: true
152+
servicesOverride:
153+
- reboot-os
154+
- compute-provider
155+
```
156+
157+
## Stages
158+
All stages must be executed in the order listed below. Everything is required unless otherwise indicated.
159+
160+
1. [Install the OpenStack K8S operators and their dependencies](../../common/)
161+
2. [Configuring networking and deploy the OpenStack control plane](control-plane.md)
162+
3. [Configure and deploy the initial dataplane](edpm-pre.md)
163+
4. [Update Dataplane to deploy necessary vGPU MDev requirements](edpm-post.md)
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Configuring networking and deploy the OpenStack control plane
2+
3+
## Assumptions
4+
5+
- A storage class called `local-storage` should already exist.
6+
7+
## Initialize
8+
9+
Switch to the "openstack" namespace
10+
```
11+
oc project openstack
12+
```
13+
Change to the nvidia-mdev directory
14+
```
15+
cd architecture/examples/va/nvidia-mdev
16+
```
17+
Edit the [control-plance/nncp/values.yaml](control-plane/nncp/values.yaml) and
18+
[control-plane/service-values.yaml](control-plane/service-values.yaml) files to suit
19+
your environment.
20+
```
21+
vi nncp/values.yaml
22+
vi service-values.yaml
23+
```
24+
25+
## Apply node network configuration
26+
27+
Generate the node network configuration
28+
```
29+
kustomize build control-plane/nncp > nncp.yaml
30+
```
31+
Apply the NNCP CRs
32+
```
33+
oc apply -f nncp.yaml
34+
```
35+
Wait for NNCPs to be available
36+
```
37+
oc wait nncp -l osp/nncm-config-type=standard --for jsonpath='{.status.conditions[0].reason}'=SuccessfullyConfigured --timeout=300s
38+
```
39+
40+
## Apply networking and control-plane configuration
41+
42+
Generate the control-plane and networking CRs.
43+
```
44+
kustomize build control-plane > control-plane.yaml
45+
```
46+
Apply the CRs
47+
```
48+
oc apply -f control-plane.yaml
49+
```
50+
51+
Wait for control plane to be available
52+
```
53+
oc wait osctlplane controlplane --for condition=Ready --timeout=600s
54+
```
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
control-plane.yaml

0 commit comments

Comments
 (0)