Skip to content

Commit bf676f4

Browse files
committed
Document Ansible Dell Boot Order Playbooks.
This documents the approach, usage and backend mechanics of our Dell Ansible boot interface order playbooks. I am also utilizing the domain variable in quads-validate-boot-order as it's not extrapolated from our /opt/quads/conf/quads.yaml configuration file like it should be. Fixes: #206 Change-Id: I7bde9286d3810dec51849eb72007235360ca114f
1 parent 1774d3d commit bf676f4

File tree

3 files changed

+157
-3
lines changed

3 files changed

+157
-3
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Automate scheduling and end-to-end provisioning of servers and networks.
3232
* [Foreman Hammer CLI](#foreman-hammer-cli)
3333
* [Ansible CMDB](#ansible-cmdb)
3434
* [QUADS Move Command](#quads-move-command)
35+
* [Ansible Dell Boot Order Playbooks](#ansible-dell-boot-order-playbooks)
3536
* [QUADS Usage Documentation](#quads-usage-documentation)
3637
* [How Provisioning Works](#how-provisioning-works)
3738
* [QUADS Move Host Command](#quads-move-host-command)
@@ -83,7 +84,7 @@ Automate scheduling and end-to-end provisioning of servers and networks.
8384
- The scheduling functionality can be used standalone, but you'll want a provisioning backend like [Foreman](https://theforeman.org/) to take full advantage of QUADS scheduling, automation and provisioning capabilities.
8485
- To utilize the automatic wiki/docs generation we use [Wordpress](https://hobo.house/2016/08/30/auto-generating-server-infrastructure-documentation-with-python-wordpress-foreman/) but anything that accepts markdown via an API should work.
8586
- Switch/VLAN automation is done on Juniper Switches in [Q-in-Q VLANs](http://www.jnpr.net/techpubs/en_US/junos14.1/topics/concept/qinq-tunneling-qfx-series.html), but commandsets can easily be extended to support other network switch models.
86-
- We use Ansible for optional Dell and SuperMicro playbooks to toggle boot order and PXE flags to accomodate OpenStack deployments via Ironic/Triple-O.
87+
- We use [Ansible](https://github.com/redhat-performance/quads/tree/master/ansible) for optional Dell playbooks to toggle boot order and PXE flags to accomodate OpenStack deployments via Ironic/Triple-O.
8788
- The package [ansible-cmdb](https://github.com/fboender/ansible-cmdb) needs to be available if you want to see per assignment Ansible facts of the inventory. It can be obtained from [here](https://github.com/fboender/ansible-cmdb/releases)
8889

8990
## QUADS Workflow
@@ -263,6 +264,9 @@ yum install ansible https://github.com/fboender/ansible-cmdb/releases/download/1
263264
#### QUADS Move Command
264265
- QUADS relies on calling an external script, trigger or workflow to enact the actual provisioning of machines. You can look at and modify our [move-and-rebuild-host](https://github.com/redhat-performance/quads/blob/master/bin/move-and-rebuild-host.sh) script to suit your environment for this purpose. Read more about this in the [move-host-command](https://github.com/redhat-performance/quads#quads-move-host-command) section below.
265266

267+
#### Ansible Dell Boot Order Playbooks
268+
- For Dell bare-metal systems we employ [optional boot interface order Ansible playbooks](https://github.com/redhat-performance/quads/tree/master/ansible), both for juggling interface order via racadm for OpenStack deployments but also to provide a way for users to set and manage their Dell BIOS boot interface order settings directly from Foreman.
269+
266270
## QUADS Usage Documentation
267271

268272
- Define the various cloud environments

ansible/README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Ansible Boot Order Playbooks for Dell Systems
2+
3+
This directory contains a set of Ansible boot order playbooks for Dell systems to arrange their boot interface order to either a `foreman` or `director` (OpenStack / PXE off internal interfaces) ordering scheme using the Dell racadm tool.
4+
5+
This allows users to manage and enforce the physical BIOS interface/boot order of Dell bare-metal servers via Foreman host parameters without having to do anything manually.
6+
7+
This will be superceded by [badfish](https://github.com/redhat-performance/badfish) which is a redfish-based Python tool, for now we manage this with Ansible, Foreman host parameters and racadm within QUADS.
8+
9+
This functionality is optional, but we find it useful as we do a lot of OpenStack and internal (private VLAN) PXE-driven application workloads and generally try to do as much systems/network prep for transient tenants as possible ahead of time so they can concentrate on their actual performance and scale testing.
10+
11+
To utilize the functionality here take a look at the [Dell interface order config](https://github.com/redhat-performance/quads/blob/master/ansible/idrac_interfaces.yml) we reference that maps BIOS boot interface order to our constructs.
12+
13+
## How it Works
14+
* Foreman-managed systems within QUADS need the following host parameter set: `nullos: true` or `nullos: false`, this can be defined in [your QUADS conf.yaml](https://github.com/redhat-performance/quads/blob/master/conf/quads.yml#L183) via the `foreman_director_parameter:` variable. Our default is `nullos:` which we'll be referring to in this document.
15+
16+
```
17+
hammer host set-parameter --host host01.example.com --name nullos --value true
18+
```
19+
* A persistent directory containing FQDN stub files is located in `/opt/quads/data/bootstate/` that contains a string, either `director` or `foreman` depending on the value of what you set as the Foreman host parameter above.
20+
21+
| Host Parameter | Value | Boot Order | Playbook that Runs |
22+
|----------------|:-------| -----------:|-------------------:|
23+
| nullos | true | director |[Dell r620 Example](https://github.com/redhat-performance/quads/blob/master/ansible/racadm-setup-boot-r620-director.yml) |
24+
| nullos | false | foreman |[Dell r620 Example](https://github.com/redhat-performance/quads/blob/master/ansible/racadm-setup-boot-r620-foreman.yml) |
25+
26+
* A [cronjob](https://github.com/redhat-performance/quads/blob/master/cron/quads#L11) runs the [quads-validate-boot-order](https://github.com/redhat-performance/quads/blob/master/bin/quads-validate-boot-order.sh) tool which creates both your `/opt/quads/data/boot` and `/opt/quads/data/bootstate` directory structure on the QUADS host, and then populates the value of each QUADS-managed host respective Foreman host parameter value at the time (translated to foreman or director above) into `/opt/quads/data/bootstate/$HOSTFQDN`.
27+
28+
* Another [cronjob](https://github.com/redhat-performance/quads/blob/master/cron/quads#L10) runs the [quads-boot-order](https://github.com/redhat-performance/quads/blob/master/bin/quads-boot-order.sh) tool and runs the appropriate [ansible playbook](https://github.com/redhat-performance/quads/blob/master/ansible/idrac_interfaces.yml) depending on the Dell system type (we expect the model name somewhere in the hostname). If you have a different boot order, combination or variation this is the file you want to edit.
29+
30+
* An ansible inventory file is generated for each host so playbook interface ordering is run in parallel across your fleet.
31+
* A healthy QUADS environment will have **no files** under `/opt/quads/data/boot` and only each hosts current boot state reflected in `/opt/quads/data/bootstate/` for each host, this means that there are no pending interface order change actions.
32+
33+
| ../boot/$FQDN contains | ../bootstate/$FQDN contains | Action Taken |
34+
|------------------------|-----------------------------|--------------|
35+
| file empty | foreman | None |
36+
| foreman | foreman | None |
37+
| foreman | director | Run director playbook until successful |
38+
| director | director | None |
39+
| director | foreman | Run foreman playbook until successful |
40+
41+
42+
### Boot Order Mechanics
43+
44+
Interface boot order for each Dell host is managed via its Foreman host parameter in one central place and this is authoritative.
45+
46+
Of note, in the [quads-validate-boot-order](https://github.com/redhat-performance/quads/blob/master/bin/quads-validate-boot-order.sh) tool there is the concept of `build_state` and `current_state`.
47+
48+
`build_state` refers to whether a system has been marked for build by the user (upon reboot Foreman would kickstart provision the host). `current_state` is the current value of the Foreman host parameter value set for `foreman_director_parameter:` (in our case this is nullos) for either true or false.
49+
50+
The following logic applies to the relationship between `current_state` and `build_state` in the [quads-validate-boot- order](https://github.com/redhat-performance/quads/blob/master/bin/quads-validate-boot-order.sh) tool:
51+
52+
| Host Parameter | Value | Boot Order | Host Marked for Build? | Action |
53+
|----------------|-------|------------|------|--------|
54+
| nullos | true | director | yes | do not change boot order if build flag present |
55+
| nullos | false | foreman | no | nothing, assume this is intentional |
56+
57+
In other words, if your Dell system is marked for build in Foreman even if you have `nullos: true` it will not create an Ansible worker file in `/opt/quads/data/boot/$FQDN` until the system has succesfully been built or the build flag is toggled off.
58+
59+
### Provisioning Mechanics
60+
61+
In our examples we define the [move-and-rebuild-host.sh](https://github.com/redhat-performance/quads/blob/master/bin/move-and-rebuild-host.sh) as our `/opt/quads/bin/quads-cli --move-hosts --path-to-command` which calls our systems and network provisioning workflow. Currently we flop interface ordering around to accomodate initial systems provisioning and then ultimately a default interface boot order, aiming primarily to accomodate OpenStack because it's the most demanding in terms of PXE order when using Triple-O/Ironic.
62+
63+
* Systems set to be provisioned have their interface ordering stub created in `/opt/quads/data/boot/$FQDN` as `foreman`
64+
* Ansible playbooks fire off per [quads-boot-order](https://github.com/redhat-performance/quads/blob/master/bin/quads-boot-order.sh)
65+
* Once the Foreman `build: 0` state is achieved (system has been provisioned with an OS sucessfully), systems have their interface ordering stub created in `/opt/quads/data/boot/$FQDN` as `director`.
66+
* Ansible playbooks fire off again to swap interface order back to `director` ordering, [respective to their system type](https://github.com/redhat-performance/quads/blob/master/ansible/idrac_interfaces.yml).
67+
* Once the value of both `/opt/quads/data/boot/$FQDN` and `/opt/quads/data/bootstate/$FQDN` match this part is completed.
68+
* `/opt/quads/data/boot/$FQDN` should be empty for each host completing their provisioning lifecycle.
69+
70+
#### instackenv.json and OpenStack
71+
72+
By default the first host out of a cloud assignment has `nullos: false` set in its Foreman host parameter, which corresponds with `/opt/quads/data/bootstate/$FQDN` maintained as `foreman'. This is because we typically associate this node in OpenStack deployments with an Undercloud node and users may need to reprovision occassionally without the complexity of swapping boot ordering or setting a one-time boot method via the iDRAC interface or badfish.
73+
74+
As an additional courtesy to OpenStack users, we also auto-generate and keep up to date an instackenv.json (OpenStack Triple-O installer answer file) via a [cronjob](https://github.com/redhat-performance/quads/blob/master/cron/quads#L12) and associated [make-instackenv-json](https://github.com/redhat-performance/quads/blob/master/bin/make-instackenv-json.sh) tool.
75+
76+
* The first machine in a cloud assignment has `nullos: false` which omits it from the instackenv.json
77+
* Setting any hosts Foreman host parameter to `nullos: false` will omit that machine from the instackenv.json
78+
* If you want to do PXE services on one of your internal, QUADS-managed interfaces you will want to maintain `nullos: true` on all your machines as the director-style interface boot ordering permits this behavior.
79+
80+
### Known Issues
81+
82+
Sometimes racadm via Ansible simply cannot set the boot interface order correctly and will just keep running trying over and over. If after several hours you still have $FQDN stub files in `/opt/quads/data/boot/` and `ansible-playbook` processes running chances are it will never complete (trust us, we've just let them go .. for science!).
83+
84+
The reasons behind this are varied:
85+
86+
* Dell / racadm sometimes caches the output of `racadm get BIOS.BiosBootSettings.BootSeq`, so it feeds Ansible incorrect data to operate on perpetually. **This is a vendor bug** and we've filed a few cases with Dell about this to no end.
87+
* Sometimes JobQueue gets hung, though Ansible tries to force clear it, sometimes only a FLEA power drain (unplug both PDU, hold power buttons for 60-120sec to drain all residual power, power back on or use our [PDU power on/off tools](https://github.com/redhat-performance/quads/blob/master/docs/pdu-setup.md) to address).
88+
* Other vendor problems, hardware issues, ghosts?
89+
90+
#### Workarounds
91+
92+
If you find yourself in a state where playbooks have had ample time to run and they just aren't doing the job, you can simply fake the interface order state and fix it manually.
93+
94+
* Disable the cron jobs that check/manage boot interface order or utilize their settings:
95+
96+
```
97+
###* * * * * /opt/quads/bin/quads-boot-order.sh 1>/dev/null 2>&1
98+
###* * * * * /opt/quads/bin/quads-validate-boot-order.sh 1>/dev/null 2>&1
99+
###* * * * * /opt/quads/bin/make-instackenv-json.sh 1>/dev/null 2>&1
100+
```
101+
102+
* Make a list of potential problem host(s)
103+
104+
```
105+
ls /etc/lab/boot/*.com > /tmp/FIX-HOSTS.txt
106+
```
107+
108+
* Next, find which cloud(s) and proper settings your hosts should be set to:
109+
110+
```
111+
cd /opt/quads/data/boot
112+
113+
for host in $(ls); do echo "====================="; printf "$host = "; cat $host; echo "state for $host = $(cat /etc/lab/bootstate/$host)"; echo "cloud is currently $(/opt/quads/bin/quads-cli --ls-schedule --host $host | grep Default | awk '{print $3}')"; echo "foreman says $(hammer host info --name $host | grep -i nullos)"; done
114+
```
115+
116+
* Then simply match what the setting should be (e.g. director instead of foreman)
117+
118+
```
119+
cd /opt/quads/boot/
120+
121+
for host in $(ls); do echo "director" > /opt/quads/data/bootstate/$host; done
122+
```
123+
124+
* Remove the `/opt/quads/data/boot/` stubs
125+
126+
```
127+
cd /opt/quads/data/boot/
128+
129+
rm -f *.com
130+
```
131+
132+
* Investigate, fix, check your hosts.
133+
134+
```
135+
cat /tmp/FIX-HOSTS.txt | sed -s 's/^/https\:\/\/mgmt-/g'
136+
```
137+
138+
* Enable cronjobs once all is well.
139+
140+
## Future Improvements
141+
142+
We're trying to automate around very old, inflexible legacy tools and BIOS/queueing systems riddled with decade-old bugs and oddities using Ansible, K.I.S.S. record/state keeping and enough user self-service to save ourselves a lot of time and effort. Newer systems like iDRAC9 tend to have less of an issue here and none of this is a problem for SuperMicro as they only adopt a bare-bones IPMI version 2.0 spec which doesn't even allow you to juggle selective network interfaces in between localdisk for your boot order.
143+
144+
Like all things it can be improved, here's where we'll be going soon::
145+
146+
* Move entirely to [badfish](https://github.com/redhat-performance/badfish) and the Redfish API
147+
* No longer juggle interface order to provision, instead utilize a one-shot boot approach (only PXE to provision on the Foreman interface when it needs to happen
148+
* Enforce only the `director` interface order, and only when machines are reclaimed but before they go out for another assignment.

bin/quads-validate-boot-order.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ data_dir=${quads["data_dir"]}
1616
quads=${quads["install_dir"]}/bin/quads-cli
1717
foreman_param=${quads["foreman_director_parameter"]}
1818
lockdir=$data_dir/lock
19+
domain=${quads["domain"]}
1920

2021
[ ! -d $lockdir ] && mkdir -p $lockdir
2122

@@ -38,7 +39,8 @@ if [ ! -d $data_dir/bootstate ]; then
3839
mkdir $data_dir/bootstate
3940
fi
4041

41-
for h in $(hammer host list --search params.${foreman_param}=true | grep redhat.com | awk '{ print $3 }' ) ; do
42+
# ensure we don't flop boot order until host is built if marked for build in nullos: true or director setting
43+
for h in $(hammer host list --search params.${foreman_param}=true | grep ${domain} | awk '{ print $3 }' ) ; do
4244
if [ -f $data_dir/bootstate/$h ]; then
4345
current_state=$(cat $data_dir/bootstate/$h)
4446
if [ "$current_state" != "director" ]; then
@@ -50,7 +52,7 @@ for h in $(hammer host list --search params.${foreman_param}=true | grep redhat.
5052
fi
5153
done
5254

53-
for h in $(hammer host list --search params.${foreman_param}=false | grep redhat.com | awk '{ print $3 }' ) ; do
55+
for h in $(hammer host list --search params.${foreman_param}=false | grep ${domain} | awk '{ print $3 }' ) ; do
5456
if [ -f $data_dir/bootstate/$h ]; then
5557
current_state=$(cat $data_dir/bootstate/$h)
5658
if [ "$current_state" != "foreman" ]; then

0 commit comments

Comments
 (0)