Skip to content

Commit bf023be

Browse files
committed
Add openstack operation docs
1 parent 0df698d commit bf023be

File tree

3 files changed

+599
-0
lines changed

3 files changed

+599
-0
lines changed
Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
=======================
2+
Operating Control Plane
3+
=======================
4+
5+
Backup of the OpenStack Control Plane
6+
=====================================
7+
8+
As the backup procedure is constantly changing, it is normally best to check
9+
the upstream documentation for an up to date procedure. Here is a high level
10+
overview of the key things you need to backup:
11+
12+
Controllers
13+
-----------
14+
15+
* `Back up SQL databases <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#performing-database-backups>`__
16+
* `Back up configuration in /etc/kolla <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#saving-overcloud-service-configuration>`__
17+
18+
Compute
19+
-------
20+
21+
The compute nodes can largely be thought of as ephemeral, but you do need to
22+
make sure you have migrated any instances and disabled the hypervisor before
23+
decommissioning or making any disruptive configuration change.
24+
25+
Monitoring
26+
----------
27+
28+
* `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
29+
* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
30+
* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
31+
32+
Seed
33+
----
34+
35+
* `Back up bifrost <https://docs.openstack.org/kayobe/latest/administration/seed.html#database-backup-restore>`__
36+
37+
Ansible control host
38+
--------------------
39+
40+
* Back up service VMs such as the seed VM
41+
42+
Control Plane Monitoring
43+
========================
44+
45+
The control plane has been configured to collect logs centrally using the EFK
46+
stack (Elasticsearch, Fluentd and Kibana).
47+
48+
Telemetry monitoring of the control plane is performed by Prometheus. Metrics
49+
are collected by Prometheus exporters, which are either running on all hosts
50+
(e.g. node exporter), on specific hosts (e.g. controllers for the memcached
51+
exporter or monitoring hosts for the OpenStack exporter). These exporters are
52+
scraped by the Prometheus server.
53+
54+
Configuring Prometheus Alerts
55+
-----------------------------
56+
57+
Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
58+
files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
59+
custom rules.
60+
61+
Silencing Prometheus Alerts
62+
---------------------------
63+
64+
Sometimes alerts must be silenced because the root cause cannot be resolved
65+
right away, such as when hardware is faulty. For example, an unreachable
66+
hypervisor will produce several alerts:
67+
68+
* ``InstanceDown`` from Node Exporter
69+
* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of
70+
the ``nova-compute`` agent on the host
71+
* ``PrometheusTargetMissing`` from several Prometheus exporters
72+
73+
Rather than silencing each alert one by one for a specific host, a silence can
74+
apply to multiple alerts using a reduced list of labels. :ref:`Log into
75+
Alertmanager <prometheus-alertmanager>`, click on the ``Silence`` button next
76+
to an alert and adjust the matcher list to keep only ``instance=<hostname>``
77+
label. Then, create another silence to match ``hostname=<hostname>`` (this is
78+
required because, for the OpenStack exporter, the instance is the host running
79+
the monitoring service rather than the host being monitored).
80+
81+
.. note::
82+
83+
After creating the silence, you may get redirected to a 404 page. This is a
84+
`known issue <https://github.com/prometheus/alertmanager/issues/1377>`__
85+
when running several Alertmanager instances behind HAProxy.
86+
87+
Generating Alerts from Metrics
88+
++++++++++++++++++++++++++++++
89+
90+
Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
91+
files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
92+
custom rules.
93+
94+
Control Plane Shutdown Procedure
95+
================================
96+
97+
Overview
98+
--------
99+
100+
* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They
101+
should all report a healthy status.
102+
* Put node into maintenance mode in bifrost to prevent it from automatically
103+
powering back on
104+
* Shutdown down nodes one at a time gracefully using systemctl poweroff
105+
106+
Controllers
107+
-----------
108+
109+
If you are restarting the controllers, it is best to do this one controller at
110+
a time to avoid the clustered components losing quorum.
111+
112+
Checking Galera state
113+
+++++++++++++++++++++
114+
115+
On each controller perform the following:
116+
117+
.. code-block:: console
118+
119+
[stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
120+
Variable_name Value
121+
wsrep_local_state_comment Synced
122+
123+
The password can be found using:
124+
125+
.. code-block:: console
126+
127+
kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \
128+
--vault-password-file <Vault password file path> | grep ^database
129+
130+
Checking RabbitMQ
131+
+++++++++++++++++
132+
133+
RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``:
134+
135+
.. code-block:: console
136+
137+
[stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status
138+
Cluster status of node rabbit@controller0 ...
139+
[{nodes,[{disc,['rabbit@controller0','rabbit@controller1',
140+
'rabbit@controller2']}]},
141+
{running_nodes,['rabbit@controller1','rabbit@controller2',
142+
'rabbit@controller0']},
143+
{cluster_name,<<"rabbit@controller2">>},
144+
{partitions,[]},
145+
{alarms,[{'rabbit@controller1',[]},
146+
{'rabbit@controller2',[]},
147+
{'rabbit@controller0',[]}]}]
148+
149+
Checking Keepalived
150+
+++++++++++++++++++
151+
152+
On (for example) three controllers:
153+
154+
.. code-block:: console
155+
156+
[stack@controller0 ~]$ docker logs keepalived
157+
158+
Two instances should show:
159+
160+
.. code-block:: console
161+
162+
VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE
163+
164+
and the other:
165+
166+
.. code-block:: console
167+
168+
VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE
169+
170+
Ansible Control Host
171+
--------------------
172+
173+
The Ansible control host is not enrolled in bifrost. This node may run services
174+
such as the seed virtual machine which will need to be gracefully powered down.
175+
176+
Compute
177+
-------
178+
179+
If you are shutting down a single hypervisor, to avoid down time to tenants it
180+
is advisable to migrate all of the instances to another machine. See
181+
:ref:`evacuating-all-instances`.
182+
183+
.. ifconfig:: deployment['ceph_managed']
184+
185+
Ceph
186+
----
187+
188+
The following guide provides a good overview:
189+
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph
190+
191+
Shutting down the seed VM
192+
-------------------------
193+
194+
.. code-block:: console
195+
196+
kayobe# virsh shutdown <Seed node>
197+
198+
.. _full-shutdown:
199+
200+
Full shutdown
201+
-------------
202+
203+
In case a full shutdown of the system is required, we advise to use the
204+
following order:
205+
206+
* Perform a graceful shutdown of all virtual machine instances
207+
* Shut down compute nodes
208+
* Shut down monitoring node
209+
* Shut down network nodes (if separate from controllers)
210+
* Shut down controllers
211+
* Shut down Ceph nodes (if applicable)
212+
* Shut down seed VM
213+
* Shut down Ansible control host
214+
215+
Rebooting a node
216+
----------------
217+
218+
Example: Reboot all compute hosts apart from compute0:
219+
220+
.. code-block:: console
221+
222+
kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r"
223+
224+
References
225+
----------
226+
227+
* https://galeracluster.com/library/training/tutorials/restarting-cluster.html
228+
229+
Control Plane Power on Procedure
230+
================================
231+
232+
Overview
233+
--------
234+
235+
* Remove the node from maintenance mode in bifrost
236+
* Bifrost should automatically power on the node via IPMI
237+
* Check that all docker containers are running
238+
* Check Kibana for any messages with log level ERROR or equivalent
239+
240+
Controllers
241+
-----------
242+
243+
If all of the servers were shut down at the same time, it is necessary to run a
244+
script to recover the database once they have all started up. This can be done
245+
with the following command:
246+
247+
.. code-block:: console
248+
249+
kayobe# kayobe overcloud database recover
250+
251+
Ansible Control Host
252+
--------------------
253+
254+
The Ansible control host is not enrolled in Bifrost and will have to be powered
255+
on manually.
256+
257+
Seed VM
258+
-------
259+
260+
The seed VM (and any other service VM) should start automatically when the seed
261+
hypervisor is powered on. If it does not, it can be started with:
262+
263+
.. code-block:: console
264+
265+
kayobe# virsh start seed-0
266+
267+
Full power on
268+
-------------
269+
270+
Follow the order in :ref:`full-shutdown`, but in reverse order.
271+
272+
Shutting Down / Restarting Monitoring Services
273+
----------------------------------------------
274+
275+
Shutting down
276+
+++++++++++++
277+
278+
Log into the monitoring host(s):
279+
280+
.. code-block:: console
281+
282+
kayobe# ssh stack@monitoring0
283+
284+
Stop all Docker containers:
285+
286+
.. code-block:: console
287+
288+
monitoring0# for i in `docker ps -q`; do docker stop $i; done
289+
290+
Shut down the node:
291+
292+
.. code-block:: console
293+
294+
monitoring0# sudo shutdown -h
295+
296+
Starting up
297+
+++++++++++
298+
299+
The monitoring services containers will automatically start when the monitoring
300+
node is powered back on.
301+
302+
Software Updates
303+
================
304+
305+
Update Packages on Control Plane
306+
--------------------------------
307+
308+
OS packages can be updated with:
309+
310+
.. code-block:: console
311+
312+
kayobe# kayobe overcloud host package update --limit <Hypervisor node> --packages '*'
313+
kayobe# kayobe overcloud seed package update --packages '*'
314+
315+
See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
316+
317+
Minor Upgrades to OpenStack Services
318+
------------------------------------
319+
320+
* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable)
321+
* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default)
322+
* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release``
323+
* Rebuild container images
324+
* Pull container images to overcloud hosts
325+
* Run kayobe overcloud service upgrade
326+
327+
For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html
328+
329+
Troubleshooting
330+
===============
331+
332+
Deploying to a Specific Hypervisor
333+
----------------------------------
334+
335+
To test creating an instance on a specific hypervisor, *as an admin-level user*
336+
you can specify the hypervisor name as part of an extended availability zone
337+
description.
338+
339+
To see the list of hypervisor names:
340+
341+
.. code-block:: console
342+
343+
admin# openstack hypervisor list
344+
345+
To boot an instance on a specific hypervisor
346+
347+
.. code-block:: console
348+
349+
admin# openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name>
350+
351+
Cleanup Procedures
352+
==================
353+
354+
OpenStack services can sometimes fail to remove all resources correctly. This
355+
is the case with Magnum, which fails to clean up users in its domain after
356+
clusters are deleted. `A patch has been submitted to stable branches
357+
<https://review.opendev.org/#/q/Ibadd5b57fe175bb0b100266e2dbcc2e1ea4efcf9>`__.
358+
Until this fix becomes available, if Magnum is in use, administrators can
359+
perform the following cleanup procedure regularly:
360+
361+
.. code-block:: console
362+
363+
admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
364+
if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
365+
echo "$user still in use, not deleting"
366+
else
367+
openstack user delete --domain magnum $user
368+
fi
369+
done
370+
371+
OpenSearch indexes retention
372+
=============================
373+
374+
To alter default rotation values for OpenSearch, edit
375+
376+
``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
377+
378+
.. code-block:: console
379+
# Duration after which index is closed (default 30)
380+
opensearch_soft_retention_period_days: 90
381+
# Duration after which index is deleted (default 60)
382+
opensearch_hard_retention_period_days: 180
383+
384+
Reconfigure Opensearch with new values:
385+
386+
.. code-block:: console
387+
kayobe overcloud service reconfigure --kolla-tags opensearch
388+
389+
For more information see the `upstream documentation
390+
391+
<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.

0 commit comments

Comments
 (0)