Skip to content

Commit b8ec077

Browse files
Add alert + kpi for datasource status conditions
1 parent 627cfa8 commit b8ec077

File tree

9 files changed

+437
-4
lines changed

9 files changed

+437
-4
lines changed

helm/bundles/cortex-cinder/alerts/cinder.alerts.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ groups:
1919
The Cortex scheduling service is down. Scheduling requests from Cinder will
2020
not be served. This is no immediate problem, since Cinder will continue
2121
placing new VMs. However, the placement will be less desirable.
22+
2223
- alert: CortexCinderKnowledgeDown
2324
expr: |
2425
up{pod=~"cortex-cinder-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
3738
The Cortex Knowledge service is down. This is no immediate problem,
3839
since cortex is still able to process requests,
3940
but the quality of the responses may be affected.
41+
4042
- alert: CortexCinderHttpRequest400sTooHigh
4143
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"4.+"}[5m]) > 0.1
4244
for: 5m
@@ -53,6 +55,7 @@ groups:
5355
errors. This is expected when the scheduling request cannot be served
5456
by Cortex. However, it could also indicate that the request format has
5557
changed and Cortex is unable to parse it.
58+
5659
- alert: CortexCinderSchedulingHttpRequest500sTooHigh
5760
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"5.+" }[5m]) > 0.1
5861
for: 5m
@@ -69,6 +72,7 @@ groups:
6972
This is not expected and indicates that Cortex is having some internal problem.
7073
Cinder will continue to place new VMs, but the placement will be less desirable.
7174
Thus, no immediate action is needed.
75+
7276
- alert: CortexCinderHighMemoryUsage
7377
expr: process_resident_memory_bytes{service="cortex-cinder-metrics"} > 6000 * 1024 * 1024
7478
for: 5m
@@ -84,6 +88,7 @@ groups:
8488
`{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
8589
should use much less, so there may be a memory leak or other changes
8690
that are causing the memory usage to increase significantly.
91+
8792
- alert: CortexCinderHighCPUUsage
8893
expr: rate(process_cpu_seconds_total{service="cortex-cinder-metrics"}[1m]) > 0.5
8994
for: 5m
@@ -99,6 +104,7 @@ groups:
99104
`{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
100105
it should use much less, so there may be a CPU leak or other changes
101106
that are causing the CPU usage to increase significantly.
107+
102108
- alert: CortexCinderTooManyDBConnectionAttempts
103109
expr: rate(cortex_db_connection_attempts_total{service="cortex-cinder-metrics"}[5m]) > 0.1
104110
for: 5m
@@ -113,6 +119,7 @@ groups:
113119
description: >
114120
`{{$labels.component}}` is trying to connect to the database too often. This may happen
115121
when the database is down or the connection parameters are misconfigured.
122+
116123
- alert: CortexCinderSyncNotSuccessful
117124
expr: cortex_sync_request_processed_total{service="cortex-cinder-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-cinder-metrics"} > 0
118125
for: 5m
@@ -131,6 +138,7 @@ groups:
131138
the sync module will retry the sync operation and the currently synced
132139
data will be kept. However, when this problem persists for a longer
133140
time the service will have a less recent view of the datacenter.
141+
134142
- alert: CortexCinderSyncObjectsDroppedToZero
135143
expr: cortex_sync_objects{service="cortex-cinder-metrics"} == 0
136144
for: 60m
@@ -149,3 +157,19 @@ groups:
149157
module will retry the sync operation and the currently synced data will
150158
be kept. However, when this problem persists for a longer time the
151159
service will have a less recent view of the datacenter.
160+
161+
- alert: CortexCinderDatasourceUnready
162+
expr: cortex_datasource_state{operator="cortex-cinder",state=~"waiting|error|unknown"} != 0
163+
for: 60m
164+
labels:
165+
context: datasources
166+
dashboard: cortex/cortex
167+
service: cortex
168+
severity: warning
169+
support_group: workload-management
170+
annotations:
171+
summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
172+
description: >
173+
This may indicate issues with the datasource
174+
connectivity or configuration. It is recommended to investigate the
175+
datasource status and logs for more details.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,12 @@
11
---
2+
apiVersion: cortex.cloud/v1alpha1
3+
kind: KPI
4+
metadata:
5+
name: cortex-cinder-datasource-state-kpi
6+
spec:
7+
operator: cortex-cinder
8+
impl: datasource_state_kpi
9+
opts:
10+
datasourceOperator: cortex-cinder
11+
description: |
12+
This KPI tracks the state of datasource resources managed by cortex.

helm/bundles/cortex-manila/alerts/manila.alerts.yaml

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ groups:
1919
The Cortex scheduling service is down. Scheduling requests from Manila will
2020
not be served. This is no immediate problem, since Manila will continue
2121
placing new VMs. However, the placement will be less desirable.
22+
2223
- alert: CortexManilaKnowledgeDown
2324
expr: |
2425
up{pod=~"cortex-manila-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
3738
The Cortex Knowledge service is down. This is no immediate problem,
3839
since cortex is still able to process requests,
3940
but the quality of the responses may be affected.
41+
4042
- alert: CortexManilaHttpRequest400sTooHigh
4143
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"4.+"}[5m]) > 0.1
4244
for: 5m
@@ -53,6 +55,7 @@ groups:
5355
errors. This is expected when the scheduling request cannot be served
5456
by Cortex. However, it could also indicate that the request format has
5557
changed and Cortex is unable to parse it.
58+
5659
- alert: CortexManilaSchedulingHttpRequest500sTooHigh
5760
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"5.+" }[5m]) > 0.1
5861
for: 5m
@@ -69,6 +72,7 @@ groups:
6972
This is not expected and indicates that Cortex is having some internal problem.
7073
Manila will continue to place new VMs, but the placement will be less desirable.
7174
Thus, no immediate action is needed.
75+
7276
- alert: CortexManilaHighMemoryUsage
7377
expr: process_resident_memory_bytes{service="cortex-manila-metrics"} > 6000 * 1024 * 1024
7478
for: 5m
@@ -84,6 +88,7 @@ groups:
8488
`{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
8589
should use much less, so there may be a memory leak or other changes
8690
that are causing the memory usage to increase significantly.
91+
8792
- alert: CortexManilaHighCPUUsage
8893
expr: rate(process_cpu_seconds_total{service="cortex-manila-metrics"}[1m]) > 0.5
8994
for: 5m
@@ -99,6 +104,7 @@ groups:
99104
`{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
100105
it should use much less, so there may be a CPU leak or other changes
101106
that are causing the CPU usage to increase significantly.
107+
102108
- alert: CortexManilaTooManyDBConnectionAttempts
103109
expr: rate(cortex_db_connection_attempts_total{service="cortex-manila-metrics"}[5m]) > 0.1
104110
for: 5m
@@ -113,6 +119,7 @@ groups:
113119
description: >
114120
`{{$labels.component}}` is trying to connect to the database too often. This may happen
115121
when the database is down or the connection parameters are misconfigured.
122+
116123
- alert: CortexManilaSyncNotSuccessful
117124
expr: cortex_sync_request_processed_total{service="cortex-manila-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-manila-metrics"} > 0
118125
for: 5m
@@ -131,6 +138,7 @@ groups:
131138
the sync module will retry the sync operation and the currently synced
132139
data will be kept. However, when this problem persists for a longer
133140
time the service will have a less recent view of the datacenter.
141+
134142
- alert: CortexManilaSyncObjectsDroppedToZero
135143
expr: cortex_sync_objects{service="cortex-manila-metrics"} == 0
136144
for: 60m
@@ -148,4 +156,20 @@ groups:
148156
module is misconfigured. No immediate action is needed, since the sync
149157
module will retry the sync operation and the currently synced data will
150158
be kept. However, when this problem persists for a longer time the
151-
service will have a less recent view of the datacenter.
159+
service will have a less recent view of the datacenter.
160+
161+
- alert: CortexManilaDatasourceUnready
162+
expr: cortex_datasource_state{operator="cortex-manila",state=~"waiting|error|unknown"} != 0
163+
for: 60m
164+
labels:
165+
context: datasources
166+
dashboard: cortex/cortex
167+
service: cortex
168+
severity: warning
169+
support_group: workload-management
170+
annotations:
171+
summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
172+
description: >
173+
This may indicate issues with the datasource
174+
connectivity or configuration. It is recommended to investigate the
175+
datasource status and logs for more details.

helm/bundles/cortex-manila/templates/kpis.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,15 @@ spec:
1111
- name: netapp-storage-pool-cpu-usage-manila
1212
description: |
1313
This KPI tracks the CPU usage of Manila NetApp storage pools.
14+
---
15+
apiVersion: cortex.cloud/v1alpha1
16+
kind: KPI
17+
metadata:
18+
name: cortex-manila-datasource-state-kpi
19+
spec:
20+
operator: cortex-manila
21+
impl: datasource_state_kpi
22+
opts:
23+
datasourceOperator: cortex-manila
24+
description: |
25+
This KPI tracks the state of datasource resources managed by cortex.

helm/bundles/cortex-nova/alerts/nova.alerts.yaml

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ groups:
1919
The Cortex scheduling service is down. Scheduling requests from Nova will
2020
not be served. This is no immediate problem, since Nova will continue
2121
placing new VMs. However, the placement will be less desirable.
22+
2223
- alert: CortexNovaKnowledgeDown
2324
expr: |
2425
up{pod=~"cortex-nova-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
3738
The Cortex Knowledge service is down. This is no immediate problem,
3839
since cortex is still able to process requests,
3940
but the quality of the responses may be affected.
41+
4042
- alert: CortexNovaDeschedulerPipelineErroring
4143
expr: delta(cortex_descheduler_pipeline_vm_descheduling_duration_seconds_count{component="nova-scheduling", error="true"}[2m]) > 0
4244
for: 5m
@@ -52,6 +54,7 @@ groups:
5254
The Cortex descheduler pipeline is encountering errors during its execution.
5355
This may indicate issues with the descheduling logic or the underlying infrastructure.
5456
It is recommended to investigate the descheduler logs and the state of the VMs being processed.
57+
5558
- alert: CortexNovaHttpRequest400sTooHigh
5659
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-nova-metrics", status=~"4.+"}[5m]) > 0.1
5760
for: 5m
@@ -68,6 +71,7 @@ groups:
6871
errors. This is expected when the scheduling request cannot be served
6972
by Cortex. However, it could also indicate that the request format has
7073
changed and Cortex is unable to parse it.
74+
7175
- alert: CortexNovaSchedulingHttpRequest500sTooHigh
7276
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-nova-metrics", status=~"5.+" }[5m]) > 0.1
7377
for: 5m
@@ -84,6 +88,7 @@ groups:
8488
This is not expected and indicates that Cortex is having some internal problem.
8589
Nova will continue to place new VMs, but the placement will be less desirable.
8690
Thus, no immediate action is needed.
91+
8792
- alert: CortexNovaHighMemoryUsage
8893
expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > 6000 * 1024 * 1024
8994
for: 5m
@@ -99,6 +104,7 @@ groups:
99104
`{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
100105
should use much less, so there may be a memory leak or other changes
101106
that are causing the memory usage to increase significantly.
107+
102108
- alert: CortexNovaHighCPUUsage
103109
expr: rate(process_cpu_seconds_total{service="cortex-nova-metrics"}[1m]) > 0.5
104110
for: 5m
@@ -114,6 +120,7 @@ groups:
114120
`{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
115121
it should use much less, so there may be a CPU leak or other changes
116122
that are causing the CPU usage to increase significantly.
123+
117124
- alert: CortexNovaTooManyDBConnectionAttempts
118125
expr: rate(cortex_db_connection_attempts_total{service="cortex-nova-metrics"}[5m]) > 0.1
119126
for: 5m
@@ -128,6 +135,7 @@ groups:
128135
description: >
129136
`{{$labels.component}}` is trying to connect to the database too often. This may happen
130137
when the database is down or the connection parameters are misconfigured.
138+
131139
- alert: CortexNovaSyncNotSuccessful
132140
expr: cortex_sync_request_processed_total{service="cortex-nova-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-nova-metrics"} > 0
133141
for: 5m
@@ -146,8 +154,9 @@ groups:
146154
the sync module will retry the sync operation and the currently synced
147155
data will be kept. However, when this problem persists for a longer
148156
time the service will have a less recent view of the datacenter.
157+
149158
- alert: CortexNovaSyncObjectsDroppedToZero
150-
expr: cortex_sync_objects{service="cortex-nova-metrics", datasource!="openstack_migrations" } == 0
159+
expr: cortex_sync_objects{service="cortex-nova-metrics", datasource!="openstack_migrations"} == 0
151160
for: 60m
152161
labels:
153162
context: syncobjects
@@ -163,4 +172,20 @@ groups:
163172
module is misconfigured. No immediate action is needed, since the sync
164173
module will retry the sync operation and the currently synced data will
165174
be kept. However, when this problem persists for a longer time the
166-
service will have a less recent view of the datacenter.
175+
service will have a less recent view of the datacenter.
176+
177+
- alert: CortexNovaDatasourceUnready
178+
expr: cortex_datasource_state{operator="cortex-nova",state=~"waiting|error|unknown"} != 0
179+
for: 60m
180+
labels:
181+
context: datasources
182+
dashboard: cortex/cortex
183+
service: cortex
184+
severity: warning
185+
support_group: workload-management
186+
annotations:
187+
summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
188+
description: >
189+
This may indicate issues with the datasource
190+
connectivity or configuration. It is recommended to investigate the
191+
datasource status and logs for more details.

helm/bundles/cortex-nova/templates/kpis.yaml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,4 +134,16 @@ spec:
134134
- name: host-details
135135
- name: host-utilization
136136
description: |
137-
This KPI tracks the total, utilized, reserved and failover capacity of KVM hosts.
137+
This KPI tracks the total, utilized, reserved and failover capacity of KVM hosts.
138+
---
139+
apiVersion: cortex.cloud/v1alpha1
140+
kind: KPI
141+
metadata:
142+
name: cortex-nova-datasource-state-kpi
143+
spec:
144+
operator: cortex-nova
145+
impl: datasource_state_kpi
146+
opts:
147+
datasourceOperator: cortex-nova
148+
description: |
149+
This KPI tracks the state of datasource resources managed by cortex.

0 commit comments

Comments
 (0)