Skip to content

Commit 332c379

Browse files
authored
Merge pull request #56782 from skrthomas/OSDOCS-5285
OSDOCS-5285: Infra Health alert for NetObserv Operator
2 parents 82e9789 + 249593e commit 332c379

File tree

5 files changed

+65
-29
lines changed

5 files changed

+65
-29
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1442,6 +1442,8 @@ Topics:
14421442
File: configuring-operator
14431443
- Name: Observing the network traffic
14441444
File: observing-network-traffic
1445+
- Name: Monitoring the Network Observability Operator
1446+
File: network-observability-operator-monitoring
14451447
- Name: API reference
14461448
File: flowcollector-api
14471449
- Name: JSON flows format reference
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-operator-monitoring.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="network-observability-disable-alerts_{context}"]
7+
= Disabling health alerts
8+
You can opt out of health alerting by editing the `FlowCollector` resource:
9+
10+
. In the web console, navigate to *Operators* -> *Installed Operators*.
11+
. Under the *Provided APIs* heading for the *NetObserv Operator*, select *Flow Collector*.
12+
. Select *cluster* then select the *YAML* tab.
13+
. Add `spec.processor.metrics.disableAlerts` to disable health alerts, as in the following YAML sample:
14+
[source,yaml]
15+
----
16+
apiVersion: flows.netobserv.io/v1alpha1
17+
kind: FlowCollector
18+
metadata:
19+
name: cluster
20+
spec:
21+
processor:
22+
metrics:
23+
disableAlerts: [NetObservLokiError, NetObservNoFlows] <1>
24+
----
25+
<1> You can specify one or a list with both types of alerts to disable.
Lines changed: 4 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
// Module included in the following assemblies:
22

33
// * networking/network_observability/installing-operators.adoc
4-
:_content-type: PROCEDURE
4+
:_content-type: CONCEPT
55
[id="network-observability-lokistack-configuring-ingestion{context}"]
66

7-
= Configuring LokiStack ingestion
8-
The LokiStack instance comes with default settings according to the configured size. It is possible to override some of these settings, such as the ingestion and query limits. You might want to update them if you get Loki errors showing up in the Console plugin, or in `flowlogs-pipeline` logs.
7+
= LokiStack ingestion limits and health alerts
8+
The LokiStack instance comes with default settings according to the configured size. It is possible to override some of these settings, such as the ingestion and query limits. You might want to update them if you get Loki errors showing up in the Console plugin, or in `flowlogs-pipeline` logs. An automatic alert in the web console notifies you when these limits are reached.
99

1010
Here is an example of configured limits:
1111

@@ -23,29 +23,4 @@ spec:
2323
maxEntriesLimitPerQuery: 10000
2424
maxQuerySeries: 3000
2525
----
26-
Refer to the LokiStack API reference for more information on these settings.
27-
28-
A good practice is to define an alert, to get notified when these limits are reached. In the example below, the alert uses a metric provided by the Loki operator, `loki_request_duration_seconds_count`:
29-
30-
[source,yaml]
31-
----
32-
33-
apiVersion: monitoring.coreos.com/v1
34-
kind: PrometheusRule
35-
metadata:
36-
name: loki-alerts
37-
namespace: openshift-operators-redhat
38-
spec:
39-
groups:
40-
- name: LokiRateLimitAlerts
41-
rules:
42-
- alert: LokiTenantRateLimit
43-
annotations:
44-
message: |-
45-
{{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.
46-
summary: "At any number of requests are responded with the rate limit error code."
47-
expr: sum(irate(loki_request_duration_seconds_count{status_code="429"}[1m])) by (job, namespace, route) / sum(irate(loki_request_duration_seconds_count[1m])) by (job, namespace, route) * 100 > 0
48-
for: 10s
49-
labels:
50-
severity: warning
51-
----
26+
For more information about these settings, see the link:https://loki-operator.dev/docs/api.md/#loki-grafana-com-v1-IngestionLimitSpec[LokiStack API reference].
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-operator-monitoring.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="network-observability-alert-dashboard_{context}"]
7+
= Viewing health information
8+
9+
You can access metrics about health and resource useage of the Network Observability Operator from the *Dashboards* page in the web console. A health alert banner that directs you to the dashboard can appear on the *Network Traffic* and *Home* pages in the event that an alert is triggered. Alerts are generated in the following cases:
10+
11+
* The *NetObservLokiError* alert occurs if the `flowlogs-pipeline` workload is dropping flows because of Loki errors, such as if the Loki ingestion rate limit has been reached.
12+
* The *NetObservNoFlows* alert occurs if no flows are ingested for a certain amount of time..Prerequisites
13+
14+
* You have the Network Observability Operator installed.
15+
* You have access to the cluster as a user with the `cluster-admin` role or with view permissions for all projects.
16+
17+
.Procedure
18+
19+
. From the *Administrator* perspective in the web console, navigate to *Observe**Dashboards*.
20+
. From the *Dashboards* dropdown, select *Netobserv/Health*.
21+
Metrics about the health of the Operator are displayed on the page.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
:_content-type: ASSEMBLY
2+
[id="network-observability-operator-monitoring"]
3+
= Monitoring the Network Observability Operator
4+
include::_attributes/common-attributes.adoc[]
5+
:context: network_observability
6+
7+
toc::[]
8+
9+
You can use the web console to monitor alerts related to the health of the Network Observability Operator.
10+
11+
12+
include::modules/network-observability-viewing-alerts.adoc[leveloffset=+1]
13+
include::modules/network-observability-disabling-health-alerts.adoc[leveloffset=+2]

0 commit comments

Comments
 (0)