Skip to content

Commit 4cac8fd

Browse files
Merge pull request #551 from jamepark4/add_mdev_install_srvc
Create example install-nvidia service Migrating the work from CIFMW [1] to the architecture repo. Instead of having CIFMW be responsible for updating the EDPMs and installing the required Nvidia driver, instead have the procedure handled by a composable service defined in the architecture repo. [1] openstack-k8s-operators/ci-framework#2637 Reviewed-by: John Fulton <[email protected]> Reviewed-by: jamepark4 <[email protected]>
2 parents 043627e + 7443555 commit 4cac8fd

File tree

6 files changed

+114
-116
lines changed

6 files changed

+114
-116
lines changed

automation/vars/nvidia-mdev.yaml

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ vas:
3131
- >-
3232
oc -n openstack wait
3333
osdpns openstack-edpm --for condition=SetupReady
34-
--timeout=60m
34+
--timeout=90m
3535
values:
3636
- name: edpm-nodeset-values
3737
src_file: values.yaml
@@ -47,16 +47,6 @@ vas:
4747
- name: edpm-deployment-values
4848
src_file: values.yaml
4949
build_output: deployment.yaml
50-
post_stage_run:
51-
- name: Run phase 1 playbook
52-
type: playbook
53-
# As a reminder, the job needs to set the nvidia driver URL
54-
source: "../../playbooks/nvidia-mdev-phase1.yml"
55-
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
56-
- name: Run phase 2 playbook
57-
type: playbook
58-
source: "../../playbooks/nvidia-mdev-phase2.yml"
59-
inventory: "${HOME}/ci-framework-data/artifacts/zuul_inventory.yml"
6050

6151
- path: examples/va/nvidia-mdev/edpm-post-driver/nodeset
6252
wait_conditions:

examples/va/nvidia-mdev/README.md

Lines changed: 22 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,9 @@ virtualization and requires real baremetal hosts.
4545
#### Nova Mdev Configuration
4646

4747
To deploy vGPU devices comprised of different types as well as the capacity to
48-
live migrate, you would need the below configuration applied to Nova.
48+
live migrate, define what mdev types should be enabled and map the respective
49+
mdev types to their pci address(es). Example below using nivida-228 and
50+
nvidia-229 as the types.
4951

5052
```YAML
5153
---
@@ -70,94 +72,32 @@ metadata:
7072
namespace: openstack
7173
```
7274
75+
#### Openstack Dataplane Composable Service
76+
77+
An Openstack Dataplane service can used to customize how the GPU cards need to
78+
be installed on the EDPM nodes. An example of OSPDS service can be seen
79+
[here](../../../va/nvidia-mdev/edpm/nodeset/nova_sriov.yaml). With the OSDPS
80+
configured, the operator would need to make sure to include the service to the
81+
list of services for the Openstack Dataplane NodeSet.
82+
83+
**Note:** The example listed is not an officially supported procedure for
84+
installing Nvidia GPU's in RHOSO and is meant to be purely an example of how
85+
to leverage OSDPS. Please reference Nvidia's documentation when creating a
86+
procedure to install GPU's.
87+
7388
#### Provider.yaml
7489
7590
In order to easily take advantage of multiple Mdev types in an environment when
7691
creating flavors, we can associate traits to specific resource providers. With
77-
provier.yaml we can map those traits and apply them as part of a deployment.
78-
79-
```YAML
80-
---
81-
apiVersion: v1
82-
data:
83-
provider.yaml: |
84-
meta:
85-
schema_version: "1.0"
86-
providers:
87-
- identification:
88-
name: edpm-compute-0.ctlplane.example.com_pci_0000_04_00_0
89-
traits:
90-
additional:
91-
- CUSTOM_NVIDIA_229
92-
- identification:
93-
name: edpm-compute-0.ctlplane.example.com_pci_0000_82_00_0
94-
traits:
95-
additional:
96-
- CUSTOM_NVIDIA_228
97-
- identification:
98-
name: edpm-compute-1.ctlplane.example.com_pci_0000_04_00_0
99-
traits:
100-
additional:
101-
- CUSTOM_NVIDIA_229
102-
- identification:
103-
name: edpm-compute-1.ctlplane.example.com_pci_0000_82_00_0
104-
traits:
105-
additional:
106-
- CUSTOM_NVIDIA_228
107-
kind: ConfigMap
108-
name: compute-provider
109-
namespace: openstack
110-
---
111-
apiVersion: dataplane.openstack.org/v1beta1
112-
kind: OpenStackDataPlaneService
113-
metadata:
114-
name: compute-provider
115-
namespace: openstack
116-
spec:
117-
addCertMounts: false
118-
caCerts: combined-ca-bundle
119-
dataSources:
120-
- configMapRef:
121-
name: compute-provider
122-
- configMapRef:
123-
name: cpu-pinning-nova
124-
- configMapRef:
125-
name: sriov-nova
126-
- secretRef:
127-
name: nova-cell1-compute-config
128-
- secretRef:
129-
name: nova-migration-ssh-key
130-
edpmServiceType: nova
131-
playbook: osp.edpm.nova
132-
tlsCerts:
133-
default:
134-
contents:
135-
- dnsnames
136-
- ips
137-
issuer: osp-rootca-issuer-internal
138-
networks:
139-
- ctlplane
140-
---
141-
apiVersion: dataplane.openstack.org/v1beta1
142-
kind: OpenStackDataPlaneDeployment
143-
metadata:
144-
name: edpm-deployment-post-driver
145-
namespace: openstack
146-
spec:
147-
ansibleExtraVars:
148-
edpm_reboot_strategy: force
149-
nodeSets:
150-
- openstack-edpm
151-
preserveJobs: true
152-
servicesOverride:
153-
- reboot-os
154-
- compute-provider
155-
```
92+
provider.yaml we can map those traits and apply them as part of a deployment.
93+
An example definition can be found [here](edpm-post-driver/nodeset/values.yaml)
94+
that associates different custom traits to different RPs.
15695
15796
## Stages
158-
All stages must be executed in the order listed below. Everything is required unless otherwise indicated.
97+
All stages must be executed in the order listed below. Everything is required
98+
unless otherwise indicated.
15999
160100
1. [Install the OpenStack K8S operators and their dependencies](../../common/)
161101
2. [Configuring networking and deploy the OpenStack control plane](control-plane.md)
162102
3. [Configure and deploy the initial dataplane](edpm-pre.md)
163-
4. [Update Dataplane to deploy necessary vGPU MDev requirements](edpm-post.md)
103+
4. [Update Dataplane to reboot EDPM nodes and optionally apply provider.yaml](edpm-post.md)

examples/va/nvidia-mdev/edpm-post.md

Lines changed: 10 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,9 @@
1-
# Download Nvidia drivers to EDPM Nodes and apply follow up deployment
1+
# Reboot EDPM Nodes and optionally apply provider.yaml
22

33
## Assumptions
44

55
- Initial [dataplane](edpm-pre.md) deployment has finalized and is successful
66

7-
## Apply necessary Nvidia configurations to EDPM Nodes
8-
### The following commands should be executed on every EDPM Node expected to provide vGPU MDevs
9-
Log into EDPM Node and update or create a blacklist file in /etc/modeprobe
10-
```
11-
cat /etc/modprobe.d/blacklist-nouveau.conf
12-
blacklist nouveau
13-
options nouveau modeset=0
14-
```
15-
Download the relevant Nvidia driver for your hardware and install per Nvidia's
16-
instructions.
17-
18-
Regenerate initramfs
19-
```
20-
dracut --force
21-
grub2-mkconfig -o /boot/grub2/grub.cfg --update-bls-cmdline
22-
```
23-
247
## Create a post deployment to finalize Nvidia configuration
258
Log out of EDPMs and return to architecture repo on the controller.
269

@@ -29,9 +12,9 @@ cd architecture/examples/va/nvidia-mdev/edpm-post-driver
2912
```
3013

3114
### Optional: Create a provider.yaml
32-
Create a configmap for the provider.yaml to map CUSTOM_TRAITS to the relevant
33-
resource providers and their MDevs, then create the corresponding service that
34-
will apply the configMap.
15+
Create a configmap for the ```provider.yaml``` to map ```CUSTOM_TRAITS``` to
16+
the relevant resource providers and their MDevs, then create the corresponding
17+
service that will apply the configMap.
3518

3619
Update the post [nodeset](edpm-post-driver/nodeset/values.yaml) values to how
3720
you wish to map resource provider to traits.
@@ -42,11 +25,15 @@ kustomize build nodeset > compute-provider-service.yaml
4225
oc apply -f compute-provider-service.yaml
4326
```
4427

45-
## Reboot EDPM Nodes and optionaly apply provider.yaml
46-
In order finish Nvidia Driver installation the EDPM Nodes will need a final
28+
## Update post deployment configration and apply
29+
In order to finish Nvidia Driver installation the EDPM Nodes will need a final
4730
reboot. This will require a new deployment that will run ```reboot-os``` on the
4831
relevant EDPM Nodes.
4932

33+
If applying the ```provider.yaml``` configuration via OSPDS from the previous
34+
optional step, then include the service ```compute-provider``` to the list of
35+
services as well.
36+
5037
Update [deployment](edpm-post-driver/deployment/values.yaml) values to suit
5138
your environment and to include provider.yaml if using.
5239
```

examples/va/nvidia-mdev/edpm/nodeset/kustomization.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,15 @@ replacements:
7272
- data.03-sriov-nova\.conf
7373
options:
7474
create: true
75+
- source:
76+
kind: ConfigMap
77+
name: edpm-nodeset-values
78+
fieldPath: data.nova.mdev.nvidia_mdev_driver_url
79+
targets:
80+
- select:
81+
kind: ConfigMap
82+
name: nvidia-url
83+
fieldPaths:
84+
- data.nvidia_mdev_driver_url
85+
options:
86+
create: true

examples/va/nvidia-mdev/edpm/nodeset/values.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ data:
136136
- neutron-sriov
137137
- libvirt
138138
- nova-custom-sriov
139+
- install-nvidia
139140
nova:
140141
compute:
141142
conf: |
@@ -158,3 +159,5 @@ data:
158159
# CHANGEME
159160
[pci]
160161
device_spec = {"vendor_id":"8086", "product_id":"1572", "address": "0000:19:00.3", "physical_network":"sriov-phy4", "trusted":"true"}
162+
mdev:
163+
nvidia_mdev_driver_url: http://example.nvidia.com/path/to/driver/example-rhel-host-driver-xyz.rpm

va/nvidia-mdev/edpm/nodeset/nova_sriov.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Note: The OSPDS shown below is not an official procedure for installing
2+
# Nvidia drivers, it is only an example of how to leverage OSPDS to customize
3+
# the installtion process.
14
---
25
apiVersion: v1
36
kind: ConfigMap
@@ -13,6 +16,13 @@ metadata:
1316
data:
1417
03-sriov-nova.conf: _replaced_
1518
---
19+
apiVersion: v1
20+
kind: ConfigMap
21+
metadata:
22+
name: nvidia-url
23+
data:
24+
nvidia_mdev_driver_url: _replaced_
25+
---
1626
apiVersion: dataplane.openstack.org/v1beta1
1727
kind: OpenStackDataPlaneService
1828
metadata:
@@ -39,3 +49,59 @@ spec:
3949
- ctlplane
4050
issuer: osp-rootca-issuer-internal
4151
caCerts: combined-ca-bundle
52+
---
53+
apiVersion: dataplane.openstack.org/v1beta1
54+
kind: OpenStackDataPlaneService
55+
metadata:
56+
name: install-nvidia
57+
namespace: openstack
58+
spec:
59+
dataSources:
60+
- configMapRef:
61+
name: nvidia-url
62+
playbookContents: |
63+
- name: Install Nvidia Driver
64+
hosts: all
65+
tasks:
66+
- name: Blacklist nouveau
67+
become: true
68+
ansible.builtin.copy:
69+
dest: "/etc/modprobe.d/blacklist-nouveau.conf"
70+
mode: "0644"
71+
content: |-
72+
blacklist nouveau
73+
options nouveau modeset=0
74+
force: false
75+
register: _blacklist_nouveau
76+
- name: Get the Nvidia Driver URL
77+
delegate_to: localhost
78+
ansible.builtin.set_fact:
79+
nvidia_rpm_url: "{{ lookup('file', '/var/lib/openstack/configs/install-nvidia/nvidia_mdev_driver_url') | from_yaml }}"
80+
- name: Gather the package facts
81+
ansible.builtin.package_facts:
82+
manager: auto
83+
- name: Install nvidia driver RPM either from path or URL
84+
become: true
85+
ansible.builtin.dnf:
86+
name: "{{ nvidia_rpm_url }}"
87+
state: present
88+
disable_gpg_check: true
89+
when: nvidia_rpm_url not in ansible_facts.packages
90+
register: _nvidia_driver_install
91+
- name: Check if grub2-mkconfig has --update-bls-cmdline option
92+
ansible.builtin.shell:
93+
cmd: grub2-mkconfig --help | grep '\-\-update-bls-cmdline'
94+
ignore_errors: true
95+
register: check_update_bls_cmdline
96+
changed_when: false
97+
- name: Regenerate initramfs
98+
become: true
99+
ansible.builtin.command: "{{ item }}"
100+
loop:
101+
- 'dracut --force'
102+
- >-
103+
grub2-mkconfig -o /boot/grub2/grub.cfg
104+
{{ '--update-bls-cmdline'
105+
if check_update_bls_cmdline.rc == 0
106+
else '' }}
107+
when: _blacklist_nouveau.changed or _nvidia_driver_install.changed

0 commit comments

Comments
 (0)