Add alert + kpi for datasource status conditions

PhilippMatthes · PhilippMatthes · commit b8ec0771dd6e · 2025-12-22T10:58:56.000+01:00
diff --git a/helm/bundles/cortex-cinder/alerts/cinder.alerts.yaml b/helm/bundles/cortex-cinder/alerts/cinder.alerts.yaml
@@ -19,6 +19,7 @@ groups:
         The Cortex scheduling service is down. Scheduling requests from Cinder will
         not be served. This is no immediate problem, since Cinder will continue
         placing new VMs. However, the placement will be less desirable.
+
   - alert: CortexCinderKnowledgeDown
     expr: |
       up{pod=~"cortex-cinder-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
         The Cortex Knowledge service is down. This is no immediate problem,
         since cortex is still able to process requests,
         but the quality of the responses may be affected.
+
   - alert: CortexCinderHttpRequest400sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"4.+"}[5m]) > 0.1
     for: 5m
@@ -53,6 +55,7 @@ groups:
         errors. This is expected when the scheduling request cannot be served
         by Cortex. However, it could also indicate that the request format has
         changed and Cortex is unable to parse it.
+
   - alert: CortexCinderSchedulingHttpRequest500sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"5.+" }[5m]) > 0.1
     for: 5m
@@ -69,6 +72,7 @@ groups:
         This is not expected and indicates that Cortex is having some internal problem.
         Cinder will continue to place new VMs, but the placement will be less desirable.
         Thus, no immediate action is needed.
+
   - alert: CortexCinderHighMemoryUsage
     expr: process_resident_memory_bytes{service="cortex-cinder-metrics"} > 6000 * 1024 * 1024
     for: 5m
@@ -84,6 +88,7 @@ groups:
         `{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
         should use much less, so there may be a memory leak or other changes
         that are causing the memory usage to increase significantly.
+
   - alert: CortexCinderHighCPUUsage
     expr: rate(process_cpu_seconds_total{service="cortex-cinder-metrics"}[1m]) > 0.5
     for: 5m
@@ -99,6 +104,7 @@ groups:
         `{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
         it should use much less, so there may be a CPU leak or other changes
         that are causing the CPU usage to increase significantly.
+
   - alert: CortexCinderTooManyDBConnectionAttempts
     expr: rate(cortex_db_connection_attempts_total{service="cortex-cinder-metrics"}[5m]) > 0.1
     for: 5m
@@ -113,6 +119,7 @@ groups:
       description: >
         `{{$labels.component}}` is trying to connect to the database too often. This may happen
         when the database is down or the connection parameters are misconfigured.
+
   - alert: CortexCinderSyncNotSuccessful
     expr: cortex_sync_request_processed_total{service="cortex-cinder-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-cinder-metrics"} > 0
     for: 5m
@@ -131,6 +138,7 @@ groups:
         the sync module will retry the sync operation and the currently synced
         data will be kept. However, when this problem persists for a longer
         time the service will have a less recent view of the datacenter.
+
   - alert: CortexCinderSyncObjectsDroppedToZero
     expr: cortex_sync_objects{service="cortex-cinder-metrics"} == 0
     for: 60m
@@ -149,3 +157,19 @@ groups:
         module will retry the sync operation and the currently synced data will
         be kept. However, when this problem persists for a longer time the
         service will have a less recent view of the datacenter.
+
+  - alert: CortexCinderDatasourceUnready
+    expr: cortex_datasource_state{operator="cortex-cinder",state=~"waiting|error|unknown"} != 0
+    for: 60m
+    labels:
+      context: datasources
+      dashboard: cortex/cortex
+      service: cortex
+      severity: warning
+      support_group: workload-management
+    annotations:
+      summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
+      description: >
+        This may indicate issues with the datasource
+        connectivity or configuration. It is recommended to investigate the
+        datasource status and logs for more details.
diff --git a/helm/bundles/cortex-cinder/templates/kpis.yaml b/helm/bundles/cortex-cinder/templates/kpis.yaml
@@ -1 +1,12 @@
 ---
+apiVersion: cortex.cloud/v1alpha1
+kind: KPI
+metadata:
+  name: cortex-cinder-datasource-state-kpi
+spec:
+  operator: cortex-cinder
+  impl: datasource_state_kpi
+  opts:
+    datasourceOperator: cortex-cinder
+  description: |
+    This KPI tracks the state of datasource resources managed by cortex.
diff --git a/helm/bundles/cortex-manila/alerts/manila.alerts.yaml b/helm/bundles/cortex-manila/alerts/manila.alerts.yaml
@@ -19,6 +19,7 @@ groups:
         The Cortex scheduling service is down. Scheduling requests from Manila will
         not be served. This is no immediate problem, since Manila will continue
         placing new VMs. However, the placement will be less desirable.
+
   - alert: CortexManilaKnowledgeDown
     expr: |
       up{pod=~"cortex-manila-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
         The Cortex Knowledge service is down. This is no immediate problem,
         since cortex is still able to process requests,
         but the quality of the responses may be affected.
+
   - alert: CortexManilaHttpRequest400sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"4.+"}[5m]) > 0.1
     for: 5m
@@ -53,6 +55,7 @@ groups:
         errors. This is expected when the scheduling request cannot be served
         by Cortex. However, it could also indicate that the request format has
         changed and Cortex is unable to parse it.
+
   - alert: CortexManilaSchedulingHttpRequest500sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"5.+" }[5m]) > 0.1
     for: 5m
@@ -69,6 +72,7 @@ groups:
         This is not expected and indicates that Cortex is having some internal problem.
         Manila will continue to place new VMs, but the placement will be less desirable.
         Thus, no immediate action is needed.
+
   - alert: CortexManilaHighMemoryUsage
     expr: process_resident_memory_bytes{service="cortex-manila-metrics"} > 6000 * 1024 * 1024
     for: 5m
@@ -84,6 +88,7 @@ groups:
         `{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
         should use much less, so there may be a memory leak or other changes
         that are causing the memory usage to increase significantly.
+
   - alert: CortexManilaHighCPUUsage
     expr: rate(process_cpu_seconds_total{service="cortex-manila-metrics"}[1m]) > 0.5
     for: 5m
@@ -99,6 +104,7 @@ groups:
         `{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
         it should use much less, so there may be a CPU leak or other changes
         that are causing the CPU usage to increase significantly.
+
   - alert: CortexManilaTooManyDBConnectionAttempts
     expr: rate(cortex_db_connection_attempts_total{service="cortex-manila-metrics"}[5m]) > 0.1
     for: 5m
@@ -113,6 +119,7 @@ groups:
       description: >
         `{{$labels.component}}` is trying to connect to the database too often. This may happen
         when the database is down or the connection parameters are misconfigured.
+
   - alert: CortexManilaSyncNotSuccessful
     expr: cortex_sync_request_processed_total{service="cortex-manila-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-manila-metrics"} > 0
     for: 5m
@@ -131,6 +138,7 @@ groups:
         the sync module will retry the sync operation and the currently synced
         data will be kept. However, when this problem persists for a longer
         time the service will have a less recent view of the datacenter.
+
   - alert: CortexManilaSyncObjectsDroppedToZero
     expr: cortex_sync_objects{service="cortex-manila-metrics"} == 0
     for: 60m
@@ -148,4 +156,20 @@ groups:
         module is misconfigured. No immediate action is needed, since the sync
         module will retry the sync operation and the currently synced data will
         be kept. However, when this problem persists for a longer time the
-        service will have a less recent view of the datacenter.
+        service will have a less recent view of the datacenter.
+
+  - alert: CortexManilaDatasourceUnready
+    expr: cortex_datasource_state{operator="cortex-manila",state=~"waiting|error|unknown"} != 0
+    for: 60m
+    labels:
+      context: datasources
+      dashboard: cortex/cortex
+      service: cortex
+      severity: warning
+      support_group: workload-management
+    annotations:
+      summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
+      description: >
+        This may indicate issues with the datasource
+        connectivity or configuration. It is recommended to investigate the
+        datasource status and logs for more details.
diff --git a/helm/bundles/cortex-manila/templates/kpis.yaml b/helm/bundles/cortex-manila/templates/kpis.yaml
@@ -11,3 +11,15 @@ spec:
       - name: netapp-storage-pool-cpu-usage-manila
   description: |
     This KPI tracks the CPU usage of Manila NetApp storage pools.
+---
+apiVersion: cortex.cloud/v1alpha1
+kind: KPI
+metadata:
+  name: cortex-manila-datasource-state-kpi
+spec:
+  operator: cortex-manila
+  impl: datasource_state_kpi
+  opts:
+    datasourceOperator: cortex-manila
+  description: |
+    This KPI tracks the state of datasource resources managed by cortex.
diff --git a/helm/bundles/cortex-nova/alerts/nova.alerts.yaml b/helm/bundles/cortex-nova/alerts/nova.alerts.yaml
@@ -19,6 +19,7 @@ groups:
         The Cortex scheduling service is down. Scheduling requests from Nova will
         not be served. This is no immediate problem, since Nova will continue
         placing new VMs. However, the placement will be less desirable.
+
   - alert: CortexNovaKnowledgeDown
     expr: |
       up{pod=~"cortex-nova-knowledge-.*"} != 1 or
@@ -37,6 +38,7 @@ groups:
         The Cortex Knowledge service is down. This is no immediate problem,
         since cortex is still able to process requests,
         but the quality of the responses may be affected.
+
   - alert: CortexNovaDeschedulerPipelineErroring
     expr: delta(cortex_descheduler_pipeline_vm_descheduling_duration_seconds_count{component="nova-scheduling", error="true"}[2m]) > 0
     for: 5m
@@ -52,6 +54,7 @@ groups:
         The Cortex descheduler pipeline is encountering errors during its execution.
         This may indicate issues with the descheduling logic or the underlying infrastructure.
         It is recommended to investigate the descheduler logs and the state of the VMs being processed.
+
   - alert: CortexNovaHttpRequest400sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-nova-metrics", status=~"4.+"}[5m]) > 0.1
     for: 5m
@@ -68,6 +71,7 @@ groups:
         errors. This is expected when the scheduling request cannot be served
         by Cortex. However, it could also indicate that the request format has
         changed and Cortex is unable to parse it.
+
   - alert: CortexNovaSchedulingHttpRequest500sTooHigh
     expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-nova-metrics", status=~"5.+" }[5m]) > 0.1
     for: 5m
@@ -84,6 +88,7 @@ groups:
         This is not expected and indicates that Cortex is having some internal problem.
         Nova will continue to place new VMs, but the placement will be less desirable.
         Thus, no immediate action is needed.
+
   - alert: CortexNovaHighMemoryUsage
     expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > 6000 * 1024 * 1024
     for: 5m
@@ -99,6 +104,7 @@ groups:
         `{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
         should use much less, so there may be a memory leak or other changes
         that are causing the memory usage to increase significantly.
+
   - alert: CortexNovaHighCPUUsage
     expr: rate(process_cpu_seconds_total{service="cortex-nova-metrics"}[1m]) > 0.5
     for: 5m
@@ -114,6 +120,7 @@ groups:
         `{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
         it should use much less, so there may be a CPU leak or other changes
         that are causing the CPU usage to increase significantly.
+
   - alert: CortexNovaTooManyDBConnectionAttempts
     expr: rate(cortex_db_connection_attempts_total{service="cortex-nova-metrics"}[5m]) > 0.1
     for: 5m
@@ -128,6 +135,7 @@ groups:
       description: >
         `{{$labels.component}}` is trying to connect to the database too often. This may happen
         when the database is down or the connection parameters are misconfigured.
+
   - alert: CortexNovaSyncNotSuccessful
     expr: cortex_sync_request_processed_total{service="cortex-nova-metrics"} - cortex_sync_request_duration_seconds_count{service="cortex-nova-metrics"} > 0
     for: 5m
@@ -146,8 +154,9 @@ groups:
         the sync module will retry the sync operation and the currently synced
         data will be kept. However, when this problem persists for a longer
         time the service will have a less recent view of the datacenter.
+
   - alert: CortexNovaSyncObjectsDroppedToZero
-    expr: cortex_sync_objects{service="cortex-nova-metrics", datasource!="openstack_migrations" } == 0
+    expr: cortex_sync_objects{service="cortex-nova-metrics", datasource!="openstack_migrations"} == 0
     for: 60m
     labels:
       context: syncobjects
@@ -163,4 +172,20 @@ groups:
         module is misconfigured. No immediate action is needed, since the sync
         module will retry the sync operation and the currently synced data will
         be kept. However, when this problem persists for a longer time the
-        service will have a less recent view of the datacenter.
+        service will have a less recent view of the datacenter.
+
+  - alert: CortexNovaDatasourceUnready
+    expr: cortex_datasource_state{operator="cortex-nova",state=~"waiting|error|unknown"} != 0
+    for: 60m
+    labels:
+      context: datasources
+      dashboard: cortex/cortex
+      service: cortex
+      severity: warning
+      support_group: workload-management
+    annotations:
+      summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
+      description: >
+        This may indicate issues with the datasource
+        connectivity or configuration. It is recommended to investigate the
+        datasource status and logs for more details.
diff --git a/helm/bundles/cortex-nova/templates/kpis.yaml b/helm/bundles/cortex-nova/templates/kpis.yaml
@@ -134,4 +134,16 @@ spec:
       - name: host-details
       - name: host-utilization
   description: |
-    This KPI tracks the total, utilized, reserved and failover capacity of KVM hosts.
+    This KPI tracks the total, utilized, reserved and failover capacity of KVM hosts.
+---
+apiVersion: cortex.cloud/v1alpha1
+kind: KPI
+metadata:
+  name: cortex-nova-datasource-state-kpi
+spec:
+  operator: cortex-nova
+  impl: datasource_state_kpi
+  opts:
+    datasourceOperator: cortex-nova
+  description: |
+    This KPI tracks the state of datasource resources managed by cortex.
diff --git a/internal/knowledge/kpis/plugins/deployment/datasource_state.go b/internal/knowledge/kpis/plugins/deployment/datasource_state.go
diff --git a/internal/knowledge/kpis/plugins/deployment/datasource_state_test.go b/internal/knowledge/kpis/plugins/deployment/datasource_state_test.go
diff --git a/internal/knowledge/kpis/supported_kpis.go b/internal/knowledge/kpis/supported_kpis.go