|
| 1 | +======================= |
| 2 | +Operating Control Plane |
| 3 | +======================= |
| 4 | + |
| 5 | +Backup of the OpenStack Control Plane |
| 6 | +===================================== |
| 7 | + |
| 8 | +As the backup procedure is constantly changing, it is normally best to check |
| 9 | +the upstream documentation for an up to date procedure. Here is a high level |
| 10 | +overview of the key things you need to backup: |
| 11 | + |
| 12 | +Controllers |
| 13 | +----------- |
| 14 | + |
| 15 | +* `Back up SQL databases <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#performing-database-backups>`__ |
| 16 | +* `Back up configuration in /etc/kolla <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#saving-overcloud-service-configuration>`__ |
| 17 | + |
| 18 | +Compute |
| 19 | +------- |
| 20 | + |
| 21 | +The compute nodes can largely be thought of as ephemeral, but you do need to |
| 22 | +make sure you have migrated any instances and disabled the hypervisor before |
| 23 | +decommissioning or making any disruptive configuration change. |
| 24 | + |
| 25 | +Monitoring |
| 26 | +---------- |
| 27 | + |
| 28 | +* `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__ |
| 29 | +* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__ |
| 30 | +* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__ |
| 31 | + |
| 32 | +Seed |
| 33 | +---- |
| 34 | + |
| 35 | +* `Back up bifrost <https://docs.openstack.org/kayobe/latest/administration/seed.html#database-backup-restore>`__ |
| 36 | + |
| 37 | +Ansible control host |
| 38 | +-------------------- |
| 39 | + |
| 40 | +* Back up service VMs such as the seed VM |
| 41 | + |
| 42 | +Control Plane Monitoring |
| 43 | +======================== |
| 44 | + |
| 45 | +The control plane has been configured to collect logs centrally using the EFK |
| 46 | +stack (Elasticsearch, Fluentd and Kibana). |
| 47 | + |
| 48 | +Telemetry monitoring of the control plane is performed by Prometheus. Metrics |
| 49 | +are collected by Prometheus exporters, which are either running on all hosts |
| 50 | +(e.g. node exporter), on specific hosts (e.g. controllers for the memcached |
| 51 | +exporter or monitoring hosts for the OpenStack exporter). These exporters are |
| 52 | +scraped by the Prometheus server. |
| 53 | + |
| 54 | +Configuring Prometheus Alerts |
| 55 | +----------------------------- |
| 56 | + |
| 57 | +Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` |
| 58 | +files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add |
| 59 | +custom rules. |
| 60 | + |
| 61 | +Silencing Prometheus Alerts |
| 62 | +--------------------------- |
| 63 | + |
| 64 | +Sometimes alerts must be silenced because the root cause cannot be resolved |
| 65 | +right away, such as when hardware is faulty. For example, an unreachable |
| 66 | +hypervisor will produce several alerts: |
| 67 | + |
| 68 | +* ``InstanceDown`` from Node Exporter |
| 69 | +* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of |
| 70 | + the ``nova-compute`` agent on the host |
| 71 | +* ``PrometheusTargetMissing`` from several Prometheus exporters |
| 72 | + |
| 73 | +Rather than silencing each alert one by one for a specific host, a silence can |
| 74 | +apply to multiple alerts using a reduced list of labels. :ref:`Log into |
| 75 | +Alertmanager <prometheus-alertmanager>`, click on the ``Silence`` button next |
| 76 | +to an alert and adjust the matcher list to keep only ``instance=<hostname>`` |
| 77 | +label. Then, create another silence to match ``hostname=<hostname>`` (this is |
| 78 | +required because, for the OpenStack exporter, the instance is the host running |
| 79 | +the monitoring service rather than the host being monitored). |
| 80 | + |
| 81 | +.. note:: |
| 82 | + |
| 83 | + After creating the silence, you may get redirected to a 404 page. This is a |
| 84 | + `known issue <https://github.com/prometheus/alertmanager/issues/1377>`__ |
| 85 | + when running several Alertmanager instances behind HAProxy. |
| 86 | + |
| 87 | +Generating Alerts from Metrics |
| 88 | +++++++++++++++++++++++++++++++ |
| 89 | + |
| 90 | +Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` |
| 91 | +files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add |
| 92 | +custom rules. |
| 93 | + |
| 94 | +Control Plane Shutdown Procedure |
| 95 | +================================ |
| 96 | + |
| 97 | +Overview |
| 98 | +-------- |
| 99 | + |
| 100 | +* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They |
| 101 | + should all report a healthy status. |
| 102 | +* Put node into maintenance mode in bifrost to prevent it from automatically |
| 103 | + powering back on |
| 104 | +* Shutdown down nodes one at a time gracefully using systemctl poweroff |
| 105 | + |
| 106 | +Controllers |
| 107 | +----------- |
| 108 | + |
| 109 | +If you are restarting the controllers, it is best to do this one controller at |
| 110 | +a time to avoid the clustered components losing quorum. |
| 111 | + |
| 112 | +Checking Galera state |
| 113 | ++++++++++++++++++++++ |
| 114 | + |
| 115 | +On each controller perform the following: |
| 116 | + |
| 117 | +.. code-block:: console |
| 118 | +
|
| 119 | + [stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'" |
| 120 | + Variable_name Value |
| 121 | + wsrep_local_state_comment Synced |
| 122 | +
|
| 123 | +The password can be found using: |
| 124 | + |
| 125 | +.. code-block:: console |
| 126 | +
|
| 127 | + kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \ |
| 128 | + --vault-password-file <Vault password file path> | grep ^database |
| 129 | +
|
| 130 | +Checking RabbitMQ |
| 131 | ++++++++++++++++++ |
| 132 | + |
| 133 | +RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``: |
| 134 | + |
| 135 | +.. code-block:: console |
| 136 | +
|
| 137 | + [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status |
| 138 | + Cluster status of node rabbit@controller0 ... |
| 139 | + [{nodes,[{disc,['rabbit@controller0','rabbit@controller1', |
| 140 | + 'rabbit@controller2']}]}, |
| 141 | + {running_nodes,['rabbit@controller1','rabbit@controller2', |
| 142 | + 'rabbit@controller0']}, |
| 143 | + {cluster_name,<<"rabbit@controller2">>}, |
| 144 | + {partitions,[]}, |
| 145 | + {alarms,[{'rabbit@controller1',[]}, |
| 146 | + {'rabbit@controller2',[]}, |
| 147 | + {'rabbit@controller0',[]}]}] |
| 148 | +
|
| 149 | +Checking Keepalived |
| 150 | ++++++++++++++++++++ |
| 151 | + |
| 152 | +On (for example) three controllers: |
| 153 | + |
| 154 | +.. code-block:: console |
| 155 | +
|
| 156 | + [stack@controller0 ~]$ docker logs keepalived |
| 157 | +
|
| 158 | +Two instances should show: |
| 159 | + |
| 160 | +.. code-block:: console |
| 161 | +
|
| 162 | + VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE |
| 163 | +
|
| 164 | +and the other: |
| 165 | + |
| 166 | +.. code-block:: console |
| 167 | +
|
| 168 | + VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE |
| 169 | +
|
| 170 | +Ansible Control Host |
| 171 | +-------------------- |
| 172 | + |
| 173 | +The Ansible control host is not enrolled in bifrost. This node may run services |
| 174 | +such as the seed virtual machine which will need to be gracefully powered down. |
| 175 | + |
| 176 | +Compute |
| 177 | +------- |
| 178 | + |
| 179 | +If you are shutting down a single hypervisor, to avoid down time to tenants it |
| 180 | +is advisable to migrate all of the instances to another machine. See |
| 181 | +:ref:`evacuating-all-instances`. |
| 182 | + |
| 183 | +.. ifconfig:: deployment['ceph_managed'] |
| 184 | + |
| 185 | + Ceph |
| 186 | + ---- |
| 187 | + |
| 188 | + The following guide provides a good overview: |
| 189 | + https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph |
| 190 | + |
| 191 | +Shutting down the seed VM |
| 192 | +------------------------- |
| 193 | + |
| 194 | +.. code-block:: console |
| 195 | +
|
| 196 | + kayobe# virsh shutdown <Seed node> |
| 197 | +
|
| 198 | +.. _full-shutdown: |
| 199 | + |
| 200 | +Full shutdown |
| 201 | +------------- |
| 202 | + |
| 203 | +In case a full shutdown of the system is required, we advise to use the |
| 204 | +following order: |
| 205 | + |
| 206 | +* Perform a graceful shutdown of all virtual machine instances |
| 207 | +* Shut down compute nodes |
| 208 | +* Shut down monitoring node |
| 209 | +* Shut down network nodes (if separate from controllers) |
| 210 | +* Shut down controllers |
| 211 | +* Shut down Ceph nodes (if applicable) |
| 212 | +* Shut down seed VM |
| 213 | +* Shut down Ansible control host |
| 214 | + |
| 215 | +Rebooting a node |
| 216 | +---------------- |
| 217 | + |
| 218 | +Example: Reboot all compute hosts apart from compute0: |
| 219 | + |
| 220 | +.. code-block:: console |
| 221 | +
|
| 222 | + kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r" |
| 223 | +
|
| 224 | +References |
| 225 | +---------- |
| 226 | + |
| 227 | +* https://galeracluster.com/library/training/tutorials/restarting-cluster.html |
| 228 | + |
| 229 | +Control Plane Power on Procedure |
| 230 | +================================ |
| 231 | + |
| 232 | +Overview |
| 233 | +-------- |
| 234 | + |
| 235 | +* Remove the node from maintenance mode in bifrost |
| 236 | +* Bifrost should automatically power on the node via IPMI |
| 237 | +* Check that all docker containers are running |
| 238 | +* Check Kibana for any messages with log level ERROR or equivalent |
| 239 | + |
| 240 | +Controllers |
| 241 | +----------- |
| 242 | + |
| 243 | +If all of the servers were shut down at the same time, it is necessary to run a |
| 244 | +script to recover the database once they have all started up. This can be done |
| 245 | +with the following command: |
| 246 | + |
| 247 | +.. code-block:: console |
| 248 | +
|
| 249 | + kayobe# kayobe overcloud database recover |
| 250 | +
|
| 251 | +Ansible Control Host |
| 252 | +-------------------- |
| 253 | + |
| 254 | +The Ansible control host is not enrolled in Bifrost and will have to be powered |
| 255 | +on manually. |
| 256 | + |
| 257 | +Seed VM |
| 258 | +------- |
| 259 | + |
| 260 | +The seed VM (and any other service VM) should start automatically when the seed |
| 261 | +hypervisor is powered on. If it does not, it can be started with: |
| 262 | + |
| 263 | +.. code-block:: console |
| 264 | +
|
| 265 | + kayobe# virsh start seed-0 |
| 266 | +
|
| 267 | +Full power on |
| 268 | +------------- |
| 269 | + |
| 270 | +Follow the order in :ref:`full-shutdown`, but in reverse order. |
| 271 | + |
| 272 | +Shutting Down / Restarting Monitoring Services |
| 273 | +---------------------------------------------- |
| 274 | + |
| 275 | +Shutting down |
| 276 | ++++++++++++++ |
| 277 | + |
| 278 | +Log into the monitoring host(s): |
| 279 | + |
| 280 | +.. code-block:: console |
| 281 | +
|
| 282 | + kayobe# ssh stack@monitoring0 |
| 283 | +
|
| 284 | +Stop all Docker containers: |
| 285 | + |
| 286 | +.. code-block:: console |
| 287 | +
|
| 288 | + monitoring0# for i in `docker ps -q`; do docker stop $i; done |
| 289 | +
|
| 290 | +Shut down the node: |
| 291 | + |
| 292 | +.. code-block:: console |
| 293 | +
|
| 294 | + monitoring0# sudo shutdown -h |
| 295 | +
|
| 296 | +Starting up |
| 297 | ++++++++++++ |
| 298 | + |
| 299 | +The monitoring services containers will automatically start when the monitoring |
| 300 | +node is powered back on. |
| 301 | + |
| 302 | +Software Updates |
| 303 | +================ |
| 304 | + |
| 305 | +Update Packages on Control Plane |
| 306 | +-------------------------------- |
| 307 | + |
| 308 | +OS packages can be updated with: |
| 309 | + |
| 310 | +.. code-block:: console |
| 311 | +
|
| 312 | + kayobe# kayobe overcloud host package update --limit <Hypervisor node> --packages '*' |
| 313 | + kayobe# kayobe overcloud seed package update --packages '*' |
| 314 | +
|
| 315 | +See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages |
| 316 | + |
| 317 | +Minor Upgrades to OpenStack Services |
| 318 | +------------------------------------ |
| 319 | + |
| 320 | +* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable) |
| 321 | +* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default) |
| 322 | +* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release`` |
| 323 | +* Rebuild container images |
| 324 | +* Pull container images to overcloud hosts |
| 325 | +* Run kayobe overcloud service upgrade |
| 326 | + |
| 327 | +For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html |
| 328 | + |
| 329 | +Troubleshooting |
| 330 | +=============== |
| 331 | + |
| 332 | +Deploying to a Specific Hypervisor |
| 333 | +---------------------------------- |
| 334 | + |
| 335 | +To test creating an instance on a specific hypervisor, *as an admin-level user* |
| 336 | +you can specify the hypervisor name as part of an extended availability zone |
| 337 | +description. |
| 338 | + |
| 339 | +To see the list of hypervisor names: |
| 340 | + |
| 341 | +.. code-block:: console |
| 342 | +
|
| 343 | + admin# openstack hypervisor list |
| 344 | +
|
| 345 | +To boot an instance on a specific hypervisor |
| 346 | + |
| 347 | +.. code-block:: console |
| 348 | +
|
| 349 | + admin# openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name> |
| 350 | +
|
| 351 | +Cleanup Procedures |
| 352 | +================== |
| 353 | + |
| 354 | +OpenStack services can sometimes fail to remove all resources correctly. This |
| 355 | +is the case with Magnum, which fails to clean up users in its domain after |
| 356 | +clusters are deleted. `A patch has been submitted to stable branches |
| 357 | +<https://review.opendev.org/#/q/Ibadd5b57fe175bb0b100266e2dbcc2e1ea4efcf9>`__. |
| 358 | +Until this fix becomes available, if Magnum is in use, administrators can |
| 359 | +perform the following cleanup procedure regularly: |
| 360 | + |
| 361 | +.. code-block:: console |
| 362 | +
|
| 363 | + admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do |
| 364 | + if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then |
| 365 | + echo "$user still in use, not deleting" |
| 366 | + else |
| 367 | + openstack user delete --domain magnum $user |
| 368 | + fi |
| 369 | + done |
| 370 | +
|
| 371 | +OpenSearch indexes retention |
| 372 | +============================= |
| 373 | + |
| 374 | +To alter default rotation values for OpenSearch, edit |
| 375 | + |
| 376 | +``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: |
| 377 | + |
| 378 | +.. code-block:: console |
| 379 | + # Duration after which index is closed (default 30) |
| 380 | + opensearch_soft_retention_period_days: 90 |
| 381 | + # Duration after which index is deleted (default 60) |
| 382 | + opensearch_hard_retention_period_days: 180 |
| 383 | +
|
| 384 | +Reconfigure Opensearch with new values: |
| 385 | + |
| 386 | +.. code-block:: console |
| 387 | + kayobe overcloud service reconfigure --kolla-tags opensearch |
| 388 | +
|
| 389 | +For more information see the `upstream documentation |
| 390 | + |
| 391 | +<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__. |
0 commit comments