Merge pull request #21 from stackhpc/monitoring

priteau · web-flow · commit 348e15450bc1 · 2022-01-26T14:58:39.000+01:00
Update monitoring documentation
diff --git a/source/introduction.rst b/source/introduction.rst
@@ -60,12 +60,6 @@ A command that must be run within the Bifrost service container, hosted on the s
 
 A command that can be run (as superuser) from a running compute instance.
 
-``monasca#``
-
-A command that must be run with OpenStack control plane admin credentials
-loaded, and the Monasca client and supporting modules available (whether in a
-virtualenv or installed in the OS libraries).
-
 Glossary of Terms
 -----------------
 
@@ -130,12 +124,6 @@ Glossary of Terms
       Multi-Chassis Link Aggregate - a method of providing multi-pathing and
       multi-switch redundancy in layer-2 networks.
 
-    Monasca
-      OpenStack’s monitoring service (“Monitoring as a Service at Scale”).
-      Logging, telemetry and events from the infrastructure, control plane and
-      user projects can be submitted and processed by Monasca.
-      https://docs.openstack.org/monasca-api/latest/
-
     Neutron
       OpenStack’s networking service.
       https://docs.openstack.org/neutron/latest/
diff --git a/source/operations_and_monitoring.rst b/source/operations_and_monitoring.rst
@@ -7,12 +7,12 @@ Operations and Monitoring
 Access to Kibana
 ================
 
-OpenStack control plane logs are aggregated from all servers by Monasca and
+OpenStack control plane logs are aggregated from all servers by Fluentd and
 stored in ElasticSearch. The control plane logs can be accessed from
 ElasticSearch using Kibana, which is available at the following URL:
 |kibana_url|
 
-To login, use the ``kibana`` user. The password is auto-generated by
+To log in, use the ``kibana`` user. The password is auto-generated by
 Kolla-Ansible and can be extracted from the encrypted passwords file
 (|kolla_passwords|):
 
@@ -24,19 +24,32 @@ Kolla-Ansible and can be extracted from the encrypted passwords file
 Access to Grafana
 =================
 
-Monasca metrics can be visualised in Grafana dashboards. Monasca Grafana can be
+Control plane metrics can be visualised in Grafana dashboards. Grafana can be
 found at the following address: |grafana_url|
 
-Grafana uses Keystone authentication. To login, use valid OpenStack user
-credentials.
+To log in, use the |grafana_username| user. The password is auto-generated by
+Kolla-Ansible and can be extracted from the encrypted passwords file
+(|kolla_passwords|):
+
+.. code-block:: console
+   :substitutions:
+
+   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^grafana_admin_password
+
+Access to Prometheus Alertmanager
+=================================
 
-To visualise control plane metrics, you will need one of the following roles in
-the ``monasca_control_plane`` project:
+Control plane alerts can be visualised and managed in Alertmanager, which can
+be found at the following address: |alertmanager_url|
 
-* ``admin``
-* ``monasca-user``
-* ``monasca-read-only-user``
-* ``monasca-editor``
+To log in, use the ``admin`` user. The password is auto-generated by
+Kolla-Ansible and can be extracted from the encrypted passwords file
+(|kolla_passwords|):
+
+.. code-block:: console
+   :substitutions:
+
+   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^prometheus_alertmanager_password
 
 Migrating virtual machines
 ==========================
@@ -246,6 +259,7 @@ Monitoring
 
 * `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
 * `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
+* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
 
 Seed
 ----
@@ -260,137 +274,21 @@ Ansible control host
 Control Plane Monitoring
 ========================
 
-Monasca has been configured to collect logs and metrics across the control
-plane. It provides a single point where control plane monitoring and telemetry
-data can be analysed and correlated.
-
-Metrics are collected per server via the `Monasca Agent
-<https://opendev.org/openstack/monasca-agent>`__. The Monasca Agent is deployed
-and configured by Kolla Ansible.
-
-Logging to Monasca is done via a `Fluentd output plugin
-<https://github.com/monasca/fluentd-monasca>`__.
-
-Configuring Monasca Alerts
---------------------------
-
-Generating Metrics from Specific Log Messages
-+++++++++++++++++++++++++++++++++++++++++++++
-
-If you wish to generate alerts for specific log messages, you must first
-generate metrics from those log messages. Metrics are generated from the
-transformed logs queue in Kafka. The Monasca log metrics service reads log
-messages from this queue, transforms them into metrics and then writes them to
-the metrics queue.
-
-The rules which govern this transformation are defined in the logstash config
-file. This file can be configured via kayobe. To do this, edit
-``etc/kayobe/kolla/config/monasca/log-metrics.conf``, for example:
-
-.. code-block:: text
-
-   # Create events from specific log signatures
-   filter {
-     if "Another thread already created a resource provider" in [log][message] {
-       mutate {
-         add_field => { "[log][dimensions][event]" => "hat" }
-       }
-     } else if "My string here" in [log][message] {
-       mutate {
-         add_field => { "[log][dimensions][event]" => "my_new_alert" }
-       }
-    }
-
-Reconfigure Monasca:
-
-.. code-block:: text
-
-   kayobe# kayobe overcloud service reconfigure --kolla-tags monasca
-
-Verify that logstash doesn't complain about your modification. On each node
-running the ``monasca-log-metrics`` service, the logs can be inspected in the
-Kolla logs directory, under the ``logstash`` folder:
-``/var/log/kolla/logstash``.
-
-Metrics will now be generated from the configured log messages. To generate
-alerts/notifications from your new metric, follow the next section.
-
-Generating Monasca Alerts from Metrics
-++++++++++++++++++++++++++++++++++++++
+The control plane has been configured to collect logs centrally using the EFK
+stack (Elasticsearch, Fluentd and Kibana).
 
-Firstly, we will configure alarms and notifications. This should be done via
-the Monasca client. More detailed documentation is available in the `Monasca
-API specification
-<https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md#alarm-definitions-and-alarms>`__.
-This document provides an overview of common use-cases.
+Telemetry monitoring of the control plane is performed by Prometheus. Metrics
+are collected by Prometheus exporters, which are either running on all hosts
+(e.g.  node exporter), on specific hosts (e.g. controllers for the memcached
+exporter or monitoring hosts for the OpenStack exporter). These exporters are
+scraped by the Prometheus server.
 
-To create a Slack notification, first obtain the URL for the notification hook
-from Slack, and configure the notification as follows:
+Configuring Prometheus Alerts
+-----------------------------
 
-.. code-block:: console
-
-   monasca# monasca notification-create stackhpc_slack SLACK https://hooks.slack.com/services/UUID
-
-You can view notifications at any time by invoking:
-
-.. code-block:: console
-
-   monasca# monasca notification-list
-
-To create an alarm with an associated notification:
-
-.. code-block:: console
-
-   monasca# monasca alarm-definition-create multiple_nova_compute \
-            '(count(log.event.multiple_nova_compute{}, deterministic)>0)' \
-            --description "Multiple nova compute instances detected" \
-            --severity HIGH --alarm-actions $NOTIFICATION_ID
-
-By default one alarm will be created for all hosts. This is typically useful
-when you are looking at the overall state of some hosts. For example in the
-screenshot below the ``db_mon_log_high_mem_usage`` alarm has previously
-triggered on a number of hosts, but is currently below threshold.
-
-If you wish to have an alarm created per host you can use the ``--match-by``
-option and specify the hostname dimension. For example:
-
-.. code-block:: console
-
-   monasca# monasca alarm-definition-create multiple_nova_compute \
-            '(count(log.event.multiple_nova_compute{}, deterministic)>0)' \
-            --description "Multiple nova compute instances detected" \
-            --severity HIGH --alarm-actions $NOTIFICATION_ID
-            --match-by hostname
-
-Creating an alarm per host can be useful when alerting on one off events such
-as log messages which need to be actioned individually. Once the issue has been
-investigated and fixed, the alarm can be deleted on a per host basis.
-
-For example, in the case of monitoring for file system corruption one might
-define a metric from the system logs alerting on XFS file system corruption, or
-ECC memory errors. These metrics may only be generated once, but it is
-important that they are not ignored. Therefore, in the example below, the last
-operator is used so that the alarm is evaluated against the last metric
-associated with the log message. Since for log metrics the value of this metric
-is always greater than 0, this alarm can only be reset by deleting it (which
-can be accomplished by clicking on the dustbin icon in Monasca Grafana). By
-ensuring that the alarm has to be manually deleted and will not reset to the OK
-status, important errors can be tracked.
-
-.. code-block:: console
-
-   monasca# monasca alarm-definition-create xfs_errors \
-            '(last(log.event.xfs_errors_detected{}, deterministic)>0)' \
-            --description "XFS errors detected on host" \
-            --severity HIGH --alarm-actions $NOTIFICATION_ID \
-            --match-by hostname
-
-It is also possible to update existing alarms. For example, to update, or add
-multiple notifications to an alarm:
-
-.. code-block:: console
-
-   monasca# monasca alarm-definition-patch $ALARM_ID --alarm-actions $NOTIFICATION_ID --alarm-actions $NOTIFICATION_ID_2
+Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
+files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
+custom rules.
 
 Control Plane Shutdown Procedure
 ================================
@@ -683,21 +581,26 @@ perform the following cleanup procedure regularly:
 
 Elasticsearch indexes retention
 ===============================
-To enable and alter default rotation values for Elasticsearch Curator edit ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml`` - This applies both to Monasca and Central Logging configurations.
+
+To enable and alter default rotation values for Elasticsearch Curator, edit
+``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
 
 .. code-block:: console
 
    # Allow Elasticsearch Curator to apply a retention policy to logs
    enable_elasticsearch_curator: true
+
    # Duration after which index is closed
    elasticsearch_curator_soft_retention_period_days: 90
+
    # Duration after which index is deleted
    elasticsearch_curator_hard_retention_period_days: 180
 
-Reconfigure elasticsearch with new values:
+Reconfigure Elasticsearch with new values:
 
 .. code-block:: console
 
-   kayobe overcloud service reconfigure --kolla-tags elasticsearch --kolla-skip-tags common --skip-precheck
+   kayobe overcloud service reconfigure --kolla-tags elasticsearch
 
-For more information see `upstream documentation <https://docs.openstack.org/kolla-ansible/ussuri/reference/logging-and-monitoring/central-logging-guide.html#curator>`__
+For more information see the `upstream documentation
+<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#curator>`__.
diff --git a/source/vars.rst b/source/vars.rst
@@ -1,3 +1,4 @@
+.. |alertmanager_url| replace:: https://openstack.acme.example:9093
 .. |base_path| replace:: ~/kayobe-env
 .. |chat_system| replace:: Slack
 .. |control_host_access| replace:: |control_host| is used as the Ansible control host. Each operator uses their own account on this host, but with a shared SSH key stored as ``~/.ssh/id_rsa``.
@@ -9,6 +10,7 @@
 .. |flavor_name| replace:: m1.tiny
 .. |floating_ip_access| replace:: from acme-seed-hypervisor and the rest of the Acme network
 .. |grafana_url| replace:: https://openstack.acme.example:3000
+.. |grafana_username| replace:: ``grafana_local_admin``
 .. |horizon_access| replace:: via the Internet.
 .. |horizon_theme_clone_url| replace:: https://github.com/acme-openstack/horizon-theme.git
 .. |horizon_theme_name| replace:: acme