|
| 1 | +============================= |
| 2 | +Hardware Inventory Management |
| 3 | +============================= |
| 4 | + |
| 5 | +At its lowest level, hardware inventory is managed in the Bifrost service. |
| 6 | + |
| 7 | +Reconfiguring Control Plane Hardware |
| 8 | +------------------------------------ |
| 9 | + |
| 10 | +If a server's hardware or firmware configuration is changed, it should be |
| 11 | +re-inspected in Bifrost before it is redeployed into service. A single server |
| 12 | +can be reinspected like this: |
| 13 | + |
| 14 | +.. code-block:: console |
| 15 | +
|
| 16 | + kayobe# kayobe overcloud hardware inspect --limit <Host name> |
| 17 | +
|
| 18 | +.. _enrolling-new-hypervisors: |
| 19 | + |
| 20 | +Enrolling New Hypervisors |
| 21 | +------------------------- |
| 22 | + |
| 23 | +New hypervisors can be added to the Bifrost inventory by using its discovery |
| 24 | +capabilities. Assuming that new hypervisors have IPMI enabled and are |
| 25 | +configured to network boot on the provisioning network, the following commands |
| 26 | +will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent |
| 27 | +kernel and ramdisk, which is configured to extract hardware information and |
| 28 | +send it to Bifrost. Note that IPMI credentials can be found in the encrypted |
| 29 | +file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``. |
| 30 | + |
| 31 | +.. code-block:: console |
| 32 | +
|
| 33 | + bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi chassis bootdev pxe |
| 34 | +
|
| 35 | +If node is are off, power them on: |
| 36 | + |
| 37 | +.. code-block:: console |
| 38 | +
|
| 39 | + bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power on |
| 40 | +
|
| 41 | +If nodes is on, reset them: |
| 42 | + |
| 43 | +.. code-block:: console |
| 44 | +
|
| 45 | + bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power reset |
| 46 | +
|
| 47 | +Once node have booted and have completed introspection, they should be visible |
| 48 | +in Bifrost: |
| 49 | + |
| 50 | +.. code-block:: console |
| 51 | +
|
| 52 | + bifrost# baremetal node list --provision-state enroll |
| 53 | + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ |
| 54 | + | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | |
| 55 | + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ |
| 56 | + | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None | power off | enroll | False | |
| 57 | + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ |
| 58 | +
|
| 59 | +After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to |
| 60 | +the correct groups, import them in Kayobe's inventory with: |
| 61 | + |
| 62 | +.. code-block:: console |
| 63 | +
|
| 64 | + kayobe# kayobe overcloud inventory discover |
| 65 | +
|
| 66 | +We can then provision and configure them: |
| 67 | + |
| 68 | +.. code-block:: console |
| 69 | +
|
| 70 | + kayobe# kayobe overcloud provision --limit <Hostname> |
| 71 | + kayobe# kayobe overcloud host configure --limit <Hostname> |
| 72 | + kayobe# kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname> |
| 73 | +
|
| 74 | +Replacing a Failing Hypervisor |
| 75 | +------------------------------ |
| 76 | + |
| 77 | +To replace a failing hypervisor, proceed as follows: |
| 78 | + |
| 79 | +* :ref:`Disable the hypervisor to avoid scheduling any new instance on it <taking-a-hypervisor-out-of-service>` |
| 80 | +* :ref:`Evacuate all instances <evacuating-all-instances>` |
| 81 | +* :ref:`Set the node to maintenance mode in Bifrost <set-bifrost-maintenance-mode>` |
| 82 | +* Physically fix or replace the node |
| 83 | +* It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning) |
| 84 | +* If the node was replaced or reprovisioned, follow :ref:`enrolling-new-hypervisors` |
| 85 | + |
| 86 | +To deprovision an existing hypervisor, run: |
| 87 | + |
| 88 | +.. code-block:: console |
| 89 | +
|
| 90 | + kayobe# kayobe overcloud deprovision --limit <Hypervisor hostname> |
| 91 | +
|
| 92 | +.. warning:: |
| 93 | + |
| 94 | + Always use ``--limit`` with ``kayobe overcloud deprovision`` on a production |
| 95 | + system. Running this command without a limit will deprovision all overcloud |
| 96 | + hosts. |
| 97 | + |
| 98 | +.. _evacuating-all-instances: |
| 99 | + |
| 100 | +Evacuating all instances |
| 101 | +------------------------ |
| 102 | + |
| 103 | +.. code-block:: console |
| 104 | +
|
| 105 | + admin# openstack server evacuate $(openstack server list --host <Hypervisor hostname> --format value --column ID) |
| 106 | +
|
| 107 | +You should now check the status of all the instances that were running on that |
| 108 | +hypervisor. They should all show the status ACTIVE. This can be verified with: |
| 109 | + |
| 110 | +.. code-block:: console |
| 111 | +
|
| 112 | + admin# openstack server show <instance uuid> |
| 113 | +
|
| 114 | +Troubleshooting |
| 115 | ++++++++++++++++ |
| 116 | + |
| 117 | +Servers that have been shut down |
| 118 | +******************************** |
| 119 | + |
| 120 | +If there are any instances that are SHUTOFF they won’t be migrated, but you can |
| 121 | +use ``openstack server migrate`` for them once the live migration is finished. |
| 122 | + |
| 123 | +Also if a VM does heavy memory access, it may take ages to migrate (Nova tries |
| 124 | +to incrementally increase the expected downtime, but is quite conservative). |
| 125 | +You can use ``openstack server migration force complete --os-compute-api-version 2.22 <instance_uuid> |
| 126 | +<migration_id>`` to trigger the final move. |
| 127 | + |
| 128 | +You get the migration ID via ``openstack server migration list --server <instance_uuid>``. |
| 129 | + |
| 130 | +For more details see: |
| 131 | +http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/ |
| 132 | + |
| 133 | +Flavors have changed |
| 134 | +******************** |
| 135 | + |
| 136 | +If the size of the flavors has changed, some instances will also fail to |
| 137 | +migrate as the process needs manual confirmation. You can do this with: |
| 138 | + |
| 139 | +.. code-block:: console |
| 140 | +
|
| 141 | + openstack # openstack server resize confirm <instance-uuid> |
| 142 | +
|
| 143 | +The symptom to look out for is that the server is showing a status of ``VERIFY |
| 144 | +RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``: |
| 145 | + |
| 146 | +.. code-block:: console |
| 147 | +
|
| 148 | + | status | VERIFY_RESIZE | |
| 149 | +
|
| 150 | +.. _set-bifrost-maintenance-mode: |
| 151 | + |
| 152 | +Set maintenance mode on a node in Bifrost |
| 153 | ++++++++++++++++++++++++++++++++++++++++++ |
| 154 | + |
| 155 | +.. code-block:: console |
| 156 | +
|
| 157 | + seed# docker exec -it bifrost_deploy /bin/bash |
| 158 | + (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost |
| 159 | + (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set <Hostname> |
| 160 | +
|
| 161 | +.. _unset-bifrost-maintenance-mode: |
| 162 | + |
| 163 | +Unset maintenance mode on a node in Bifrost |
| 164 | ++++++++++++++++++++++++++++++++++++++++++++ |
| 165 | + |
| 166 | +.. code-block:: console |
| 167 | +
|
| 168 | + seed# docker exec -it bifrost_deploy /bin/bash |
| 169 | + (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost |
| 170 | + (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset <Hostname> |
| 171 | +
|
| 172 | +Detect hardware differences with ADVise |
| 173 | +======================================= |
| 174 | + |
| 175 | +Extract Bifrost introspection data |
| 176 | +---------------------------------- |
| 177 | + |
| 178 | +The ADVise tool assumes that hardware introspection data has already been gathered in JSON format. |
| 179 | +The ``extra-hardware`` disk builder element enabled when building the IPA image for the required data to be available. |
| 180 | + |
| 181 | +To build ipa image with extra-hardware you need to edit ``ipa.yml`` and add this: |
| 182 | +.. code-block:: console |
| 183 | +
|
| 184 | + # Whether to build IPA images from source. |
| 185 | + ipa_build_images: true |
| 186 | +
|
| 187 | + # List of additional Diskimage Builder (DIB) elements to use when building IPA |
| 188 | + images. Default is none. |
| 189 | + ipa_build_dib_elements_extra: |
| 190 | + - "extra-hardware" |
| 191 | +
|
| 192 | + # List of additional inspection collectors to run. |
| 193 | + ipa_collectors_extra: |
| 194 | + - "extra-hardware" |
| 195 | +
|
| 196 | +Extract introspection data from Bifrost with Kayobe. JSON files will be created |
| 197 | +into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: |
| 198 | + |
| 199 | +.. code-block:: console |
| 200 | +
|
| 201 | + kayobe# kayobe overcloud introspection data save |
| 202 | +
|
| 203 | +Using ADVise |
| 204 | +------------ |
| 205 | + |
| 206 | +Hardware information captured during the Ironic introspection process can be |
| 207 | +analysed to detect hardware differences, such as mismatches in firmware |
| 208 | +versions or missing storage devices. The `ADVise <https://github.com/stackhpc/ADVise>`__ |
| 209 | +tool can be used for this purpose. |
| 210 | + |
| 211 | +The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``. |
| 212 | + |
| 213 | +The playbook will: |
| 214 | + |
| 215 | +1. Install ADVise and dependencies |
| 216 | +2. Run the mungetout utility for extracting the required information from the introspection data ready for use with ADVise. |
| 217 | +3. Run ADVise on the data. |
| 218 | + |
| 219 | +.. code-block:: console |
| 220 | +
|
| 221 | + cd ${KAYOBE_CONFIG_PATH} |
| 222 | + ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml |
| 223 | +
|
| 224 | +The playbook has the following optional parameters: |
| 225 | + |
| 226 | +- venv : path to the virtual environment to use. Default: ``"~/venvs/advise-review"`` |
| 227 | +- input_dir: path to the hardware introspection data. Default: ``"{{ lookup('env', 'PWD') }}/overcloud-introspection-data"`` |
| 228 | +- output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"`` |
| 229 | +- advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"`` |
| 230 | + |
| 231 | +Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…: |
| 232 | + |
| 233 | +.. code-block:: console |
| 234 | +
|
| 235 | + ansible-playbook advise-run.yml -e advise_pattern=’(cpt)(.*)(.eval)’ |
| 236 | +
|
| 237 | +
|
| 238 | +.. note:: |
| 239 | + The mungetout utility will always use the file extension .eval |
| 240 | + |
| 241 | +Using the results |
| 242 | +----------------- |
| 243 | + |
| 244 | +The ADVise tool will output a selection of results found under output_dir/results these include: |
| 245 | + |
| 246 | +- ``.html`` files to display network visualisations of any hardware differences. |
| 247 | +- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are. |
| 248 | +- ``_summary``, a listing of how the systems can be grouped into sets of identical hardware. |
| 249 | +- ``_performance``, the results of analysing the benchmarking data gathered. |
| 250 | +- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform. |
| 251 | + |
| 252 | +To get visuallised result, It is recommanded to copy instrospection data and review directories to your |
| 253 | +local machine then run ADVise playbook locally with the data. |
0 commit comments