Skip to content

Commit 9e97592

Browse files
committed
Add hardware inventory management doc
1 parent e2be832 commit 9e97592

File tree

1 file changed

+253
-0
lines changed

1 file changed

+253
-0
lines changed
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
=============================
2+
Hardware Inventory Management
3+
=============================
4+
5+
At its lowest level, hardware inventory is managed in the Bifrost service.
6+
7+
Reconfiguring Control Plane Hardware
8+
------------------------------------
9+
10+
If a server's hardware or firmware configuration is changed, it should be
11+
re-inspected in Bifrost before it is redeployed into service. A single server
12+
can be reinspected like this:
13+
14+
.. code-block:: console
15+
16+
kayobe# kayobe overcloud hardware inspect --limit <Host name>
17+
18+
.. _enrolling-new-hypervisors:
19+
20+
Enrolling New Hypervisors
21+
-------------------------
22+
23+
New hypervisors can be added to the Bifrost inventory by using its discovery
24+
capabilities. Assuming that new hypervisors have IPMI enabled and are
25+
configured to network boot on the provisioning network, the following commands
26+
will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent
27+
kernel and ramdisk, which is configured to extract hardware information and
28+
send it to Bifrost. Note that IPMI credentials can be found in the encrypted
29+
file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``.
30+
31+
.. code-block:: console
32+
33+
bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi chassis bootdev pxe
34+
35+
If node is are off, power them on:
36+
37+
.. code-block:: console
38+
39+
bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power on
40+
41+
If nodes is on, reset them:
42+
43+
.. code-block:: console
44+
45+
bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power reset
46+
47+
Once node have booted and have completed introspection, they should be visible
48+
in Bifrost:
49+
50+
.. code-block:: console
51+
52+
bifrost# baremetal node list --provision-state enroll
53+
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
54+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
55+
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
56+
| da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None | power off | enroll | False |
57+
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
58+
59+
After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to
60+
the correct groups, import them in Kayobe's inventory with:
61+
62+
.. code-block:: console
63+
64+
kayobe# kayobe overcloud inventory discover
65+
66+
We can then provision and configure them:
67+
68+
.. code-block:: console
69+
70+
kayobe# kayobe overcloud provision --limit <Hostname>
71+
kayobe# kayobe overcloud host configure --limit <Hostname>
72+
kayobe# kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname>
73+
74+
Replacing a Failing Hypervisor
75+
------------------------------
76+
77+
To replace a failing hypervisor, proceed as follows:
78+
79+
* :ref:`Disable the hypervisor to avoid scheduling any new instance on it <taking-a-hypervisor-out-of-service>`
80+
* :ref:`Evacuate all instances <evacuating-all-instances>`
81+
* :ref:`Set the node to maintenance mode in Bifrost <set-bifrost-maintenance-mode>`
82+
* Physically fix or replace the node
83+
* It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning)
84+
* If the node was replaced or reprovisioned, follow :ref:`enrolling-new-hypervisors`
85+
86+
To deprovision an existing hypervisor, run:
87+
88+
.. code-block:: console
89+
90+
kayobe# kayobe overcloud deprovision --limit <Hypervisor hostname>
91+
92+
.. warning::
93+
94+
Always use ``--limit`` with ``kayobe overcloud deprovision`` on a production
95+
system. Running this command without a limit will deprovision all overcloud
96+
hosts.
97+
98+
.. _evacuating-all-instances:
99+
100+
Evacuating all instances
101+
------------------------
102+
103+
.. code-block:: console
104+
105+
admin# openstack server evacuate $(openstack server list --host <Hypervisor hostname> --format value --column ID)
106+
107+
You should now check the status of all the instances that were running on that
108+
hypervisor. They should all show the status ACTIVE. This can be verified with:
109+
110+
.. code-block:: console
111+
112+
admin# openstack server show <instance uuid>
113+
114+
Troubleshooting
115+
+++++++++++++++
116+
117+
Servers that have been shut down
118+
********************************
119+
120+
If there are any instances that are SHUTOFF they won’t be migrated, but you can
121+
use ``openstack server migrate`` for them once the live migration is finished.
122+
123+
Also if a VM does heavy memory access, it may take ages to migrate (Nova tries
124+
to incrementally increase the expected downtime, but is quite conservative).
125+
You can use ``openstack server migration force complete --os-compute-api-version 2.22 <instance_uuid>
126+
<migration_id>`` to trigger the final move.
127+
128+
You get the migration ID via ``openstack server migration list --server <instance_uuid>``.
129+
130+
For more details see:
131+
http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/
132+
133+
Flavors have changed
134+
********************
135+
136+
If the size of the flavors has changed, some instances will also fail to
137+
migrate as the process needs manual confirmation. You can do this with:
138+
139+
.. code-block:: console
140+
141+
openstack # openstack server resize confirm <instance-uuid>
142+
143+
The symptom to look out for is that the server is showing a status of ``VERIFY
144+
RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``:
145+
146+
.. code-block:: console
147+
148+
| status | VERIFY_RESIZE |
149+
150+
.. _set-bifrost-maintenance-mode:
151+
152+
Set maintenance mode on a node in Bifrost
153+
+++++++++++++++++++++++++++++++++++++++++
154+
155+
.. code-block:: console
156+
157+
seed# docker exec -it bifrost_deploy /bin/bash
158+
(bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
159+
(bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set <Hostname>
160+
161+
.. _unset-bifrost-maintenance-mode:
162+
163+
Unset maintenance mode on a node in Bifrost
164+
+++++++++++++++++++++++++++++++++++++++++++
165+
166+
.. code-block:: console
167+
168+
seed# docker exec -it bifrost_deploy /bin/bash
169+
(bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
170+
(bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset <Hostname>
171+
172+
Detect hardware differences with ADVise
173+
=======================================
174+
175+
Extract Bifrost introspection data
176+
----------------------------------
177+
178+
The ADVise tool assumes that hardware introspection data has already been gathered in JSON format.
179+
The ``extra-hardware`` disk builder element enabled when building the IPA image for the required data to be available.
180+
181+
To build ipa image with extra-hardware you need to edit ``ipa.yml`` and add this:
182+
.. code-block:: console
183+
184+
# Whether to build IPA images from source.
185+
ipa_build_images: true
186+
187+
# List of additional Diskimage Builder (DIB) elements to use when building IPA
188+
images. Default is none.
189+
ipa_build_dib_elements_extra:
190+
- "extra-hardware"
191+
192+
# List of additional inspection collectors to run.
193+
ipa_collectors_extra:
194+
- "extra-hardware"
195+
196+
Extract introspection data from Bifrost with Kayobe. JSON files will be created
197+
into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``:
198+
199+
.. code-block:: console
200+
201+
kayobe# kayobe overcloud introspection data save
202+
203+
Using ADVise
204+
------------
205+
206+
Hardware information captured during the Ironic introspection process can be
207+
analysed to detect hardware differences, such as mismatches in firmware
208+
versions or missing storage devices. The `ADVise <https://github.com/stackhpc/ADVise>`__
209+
tool can be used for this purpose.
210+
211+
The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``.
212+
213+
The playbook will:
214+
215+
1. Install ADVise and dependencies
216+
2. Run the mungetout utility for extracting the required information from the introspection data ready for use with ADVise.
217+
3. Run ADVise on the data.
218+
219+
.. code-block:: console
220+
221+
cd ${KAYOBE_CONFIG_PATH}
222+
ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml
223+
224+
The playbook has the following optional parameters:
225+
226+
- venv : path to the virtual environment to use. Default: ``"~/venvs/advise-review"``
227+
- input_dir: path to the hardware introspection data. Default: ``"{{ lookup('env', 'PWD') }}/overcloud-introspection-data"``
228+
- output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"``
229+
- advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"``
230+
231+
Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…:
232+
233+
.. code-block:: console
234+
235+
ansible-playbook advise-run.yml -e advise_pattern=’(cpt)(.*)(.eval)’
236+
237+
238+
.. note::
239+
The mungetout utility will always use the file extension .eval
240+
241+
Using the results
242+
-----------------
243+
244+
The ADVise tool will output a selection of results found under output_dir/results these include:
245+
246+
- ``.html`` files to display network visualisations of any hardware differences.
247+
- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are.
248+
- ``_summary``, a listing of how the systems can be grouped into sets of identical hardware.
249+
- ``_performance``, the results of analysing the benchmarking data gathered.
250+
- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform.
251+
252+
To get visuallised result, It is recommanded to copy instrospection data and review directories to your
253+
local machine then run ADVise playbook locally with the data.

0 commit comments

Comments
 (0)