Merge pull request #56782 from skrthomas/OSDOCS-5285

skrthomas · web-flow · commit 332c379888fd · 2023-04-18T09:55:39.000-04:00
OSDOCS-5285: Infra Health alert for NetObserv Operator
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -1442,6 +1442,8 @@ Topics:
     File: configuring-operator
   - Name: Observing the network traffic
     File: observing-network-traffic
+  - Name: Monitoring the Network Observability Operator
+    File: network-observability-operator-monitoring
   - Name: API reference
     File: flowcollector-api
   - Name: JSON flows format reference
diff --git a/modules/network-observability-disabling-health-alerts.adoc b/modules/network-observability-disabling-health-alerts.adoc
@@ -0,0 +1,25 @@
+// Module included in the following assemblies:
+//
+// * network_observability/network-observability-operator-monitoring.adoc
+
+:_content-type: PROCEDURE
+[id="network-observability-disable-alerts_{context}"]
+= Disabling health alerts
+You can opt out of health alerting by editing the `FlowCollector` resource:
+
+. In the web console, navigate to *Operators* -> *Installed Operators*.
+. Under the *Provided APIs* heading for the *NetObserv Operator*, select *Flow Collector*. 
+. Select *cluster* then select the *YAML* tab. 
+. Add `spec.processor.metrics.disableAlerts` to disable health alerts, as in the following YAML sample:
+[source,yaml]
+----
+apiVersion: flows.netobserv.io/v1alpha1
+kind: FlowCollector
+metadata:
+  name: cluster
+spec:
+  processor:
+    metrics:
+      disableAlerts: [NetObservLokiError, NetObservNoFlows] <1>
+----
+<1> You can specify one or a list with both types of alerts to disable. 
diff --git a/modules/network-observability-lokistack-ingestion-query.adoc b/modules/network-observability-lokistack-ingestion-query.adoc
@@ -1,11 +1,11 @@
 // Module included in the following assemblies:
 
 // * networking/network_observability/installing-operators.adoc
-:_content-type: PROCEDURE
+:_content-type: CONCEPT
 [id="network-observability-lokistack-configuring-ingestion{context}"]
 
-= Configuring LokiStack ingestion
-The LokiStack instance comes with default settings according to the configured size. It is possible to override some of these settings, such as the ingestion and query limits. You might want to update them if you get Loki errors showing up in the Console plugin, or in `flowlogs-pipeline` logs.
+= LokiStack ingestion limits and health alerts
+The LokiStack instance comes with default settings according to the configured size. It is possible to override some of these settings, such as the ingestion and query limits. You might want to update them if you get Loki errors showing up in the Console plugin, or in `flowlogs-pipeline` logs. An automatic alert in the web console notifies you when these limits are reached.
 
 Here is an example of configured limits:
 
@@ -23,29 +23,4 @@ spec:
         maxEntriesLimitPerQuery: 10000
         maxQuerySeries: 3000
 ----
-Refer to the LokiStack API reference for more information on these settings.
-
-A good practice is to define an alert, to get notified when these limits are reached. In the example below, the alert uses a metric provided by the Loki operator, `loki_request_duration_seconds_count`:
-
-[source,yaml]
-----
-
-apiVersion: monitoring.coreos.com/v1
-kind: PrometheusRule
-metadata:
-  name: loki-alerts
-  namespace: openshift-operators-redhat
-spec:
-  groups:
-  - name: LokiRateLimitAlerts
-    rules:
-    - alert: LokiTenantRateLimit
-      annotations:
-        message: |-
-          {{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.
-        summary: "At any number of requests are responded with the rate limit error code."
-      expr: sum(irate(loki_request_duration_seconds_count{status_code="429"}[1m])) by (job, namespace, route) / sum(irate(loki_request_duration_seconds_count[1m])) by (job, namespace, route) * 100 > 0
-      for: 10s
-      labels:
-        severity: warning
-----
+For more information about these settings, see the link:https://loki-operator.dev/docs/api.md/#loki-grafana-com-v1-IngestionLimitSpec[LokiStack API reference]. 
diff --git a/modules/network-observability-viewing-alerts.adoc b/modules/network-observability-viewing-alerts.adoc
@@ -0,0 +1,21 @@
+// Module included in the following assemblies:
+//
+// * network_observability/network-observability-operator-monitoring.adoc
+
+:_content-type: PROCEDURE
+[id="network-observability-alert-dashboard_{context}"]
+= Viewing health information
+
+You can access metrics about health and resource useage of the Network Observability Operator from the *Dashboards* page in the web console. A health alert banner that directs you to the dashboard can appear on the *Network Traffic* and *Home* pages in the event that an alert is triggered. Alerts are generated in the following cases:
+
+* The *NetObservLokiError* alert occurs if the `flowlogs-pipeline` workload is dropping flows because of Loki errors, such as if the Loki ingestion rate limit has been reached.
+* The *NetObservNoFlows* alert occurs if no flows are ingested for a certain amount of time..Prerequisites
+
+* You have the Network Observability Operator installed.
+* You have access to the cluster as a user with the `cluster-admin` role or with view permissions for all projects.
+
+.Procedure
+
+. From the *Administrator* perspective in the web console, navigate to *Observe* → *Dashboards*.
+. From the *Dashboards* dropdown, select *Netobserv/Health*. 
+Metrics about the health of the Operator are displayed on the page. 
diff --git a/networking/network_observability/network-observability-operator-monitoring.adoc b/networking/network_observability/network-observability-operator-monitoring.adoc
@@ -0,0 +1,13 @@
+:_content-type: ASSEMBLY
+[id="network-observability-operator-monitoring"]
+= Monitoring the Network Observability Operator
+include::_attributes/common-attributes.adoc[]
+:context: network_observability
+
+toc::[]
+
+You can use the web console to monitor alerts related to the health of the Network Observability Operator. 
+
+
+include::modules/network-observability-viewing-alerts.adoc[leveloffset=+1]
+include::modules/network-observability-disabling-health-alerts.adoc[leveloffset=+2]