Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ rules:
id: CRE-2024-0007
severity: 0
title: RabbitMQ Mnesia overloaded recovering persistent queues
category: message-queue-problem
pillar: Infrastructure
category: resource-exhaustion-cpu
author: Prequel
description: |
The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. The underlying Erlang process, Mnesia, is overloaded (`** WARNING ** Mnesia is overloaded`).
Expand All @@ -13,9 +14,7 @@ rules:
RabbitMQ is unable to process any new messages, which can lead to outages in consumers and producers.
impactScore: 9
tags:
- known-problem
- rabbitmq
- public
- high-cpu-usage
mitigation: |
- Increase the size of the cluster
- Increase the Kubernetes CPU limits for the RabbitMQ brokers
Expand Down Expand Up @@ -46,4 +45,4 @@ rules:
slide: 30s
- value: "SIGTERM received - shutting down"
anchor: 1
window: 10s
window: 10s
10 changes: 5 additions & 5 deletions rules/cre-2024-0008/rabbitmq-memory-alarm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ rules:
id: CRE-2024-0008
severity: 1
title: RabbitMQ memory alarm
category: message-queue-problem
pillar: Infrastructure
category: resource-exhaustion-memory
author: Prequel
description: |
A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables,
Expand All @@ -22,9 +23,8 @@ rules:
- Application components that rely on timely message delivery may experience delays or timeouts, degrading user-facing services.
impactScore: 9
tags:
- known-problem
- rabbitmq
- public
- oom-error
- back-pressure
mitigation: |
- Inspect memory usage to identify queues or processes consuming RAM:
`rabbitmq-diagnostics memory_breakdown -n <node>`
Expand Down Expand Up @@ -56,4 +56,4 @@ rules:
negate:
- value: memory resource limit alarm cleared
anchor: 1
window: 15s
window: 15s
12 changes: 6 additions & 6 deletions rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
rules:
- cre:
id: CRE-2024-0008
id: CRE-2024-0014
severity: 1
title: RabbitMQ busy distribution port performance issue
category: message-queue-performance
pillar: Infrastructure
category: resource-exhaustion-connections
author: Prequel
description: |
The Erlang VM has reported a **`busy_dist_port`** condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a
Expand All @@ -24,9 +25,8 @@ rules:
- Severe cases can drop inter-node links, triggering partition-handling logic and service outages.
impactScore: 8
tags:
- known-problem
- rabbitmq
- public
- high-latency
- invalid-payload
mitigation: |
- **Diagnose** – run `rabbitmq-diagnostics busy_dist_port` (3.13+) or inspect warnings to identify affected nodes.
- **Raise the buffer limit** – set `RABBITMQ_DISTRIBUTION_BUFFER_SIZE=512000` (≈ 512 MB) or pass `-zdbbl 512000` to the Erlang VM; restart the node. Ensure the pod / host memory limit is increased accordingly (≥ 512 MB × node count).
Expand All @@ -51,4 +51,4 @@ rules:
event:
source: cre.log.rabbitmq
match:
- regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port"
- regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port"
9 changes: 4 additions & 5 deletions rules/cre-2024-0016/gke-metrics-export-failed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@ rules:
id: CRE-2024-0016
severity: 3
title: Google Kubernetes Engine metrics agent failing to export metrics
category: observability-problem
category: persistence-failure
author: Prequel
description: The Google Kubernetes Engine metrics agent is failing to export metrics.
cause: |
The GKE team is aware of this issue and is working on a fix.
tags:
- known-problem
- gke
- public
- upstream-failure
mitigation: |
`gcloud logging sinks update _Default --add-exclusion=name=exclude-unimportant-gke-metadata-server-logs,filter=' resource.type = "k8s_container" resource.labels.namespace_name = "kube-system" resource.labels.pod_name =~ "gke-metadata-server-.*" resource.labels.container_name = "gke-metadata-server" severity <= "INFO" '`
mitigationScore: 2
Expand All @@ -30,6 +28,7 @@ rules:
version: "1.30.x"
- name: "Google Kubernetes Engine (GKE)"
version: "1.31.x"
pillar: Data
metadata:
kind: prequel
id: rBj7HEGesPj8suW6G3DvrJ
Expand All @@ -39,4 +38,4 @@ rules:
event:
source: cre.log.gke-metrics-agent
match:
- regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/
- regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/
10 changes: 5 additions & 5 deletions rules/cre-2024-0018/ovn-high-cpu-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ rules:
id: CRE-2024-0018
severity: 2
title: Neutron Open Virtual Network (OVN) high CPU usage
category: networking-problem
category: resource-exhaustion-cpu
author: Prequel
description: |
OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or
Expand All @@ -13,10 +13,10 @@ rules:
- Burst of logical-flow updates (security-groups, LB changes)
- Poll-loop bug in OVN ≤ 20.2.0
- CPU contention with GPU workloads; no offload/D PDK
tags:
- known-problem
- ovn
- public
tags:
- high-cpu-usage
- bug
pillar: Infrastructure
mitigation: |
Increase the OVN remote probe interval to 30 seconds:
```
Expand Down
7 changes: 3 additions & 4 deletions rules/cre-2024-0043/nginx-upstream-failure.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ rules:
id: CRE-2024-0043
severity: 2
title: NGINX Upstream DNS Failure
category: proxy-problems
pillar: Networking
category: connectivity-dns-failure
author: Prequel
description: |
When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.
Expand All @@ -13,9 +14,7 @@ rules:
Clients experience partial or total service interruptions until the upstream is restored or reconfigured.
impactScore: 6
tags:
- kafka
- known-problem
- public
- upstream-failure
mitigation: |
Provide a stable or redundant upstream configuration so NGINX can gracefully handle DNS resolution failures.
mitigationScore: 5
Expand Down
10 changes: 4 additions & 6 deletions rules/cre-2025-0021/keda-nil-pointer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ rules:
id: CRE-2024-0021
severity: 1
title: KEDA operator reconciler ScaledObject panic
category: operator-problem
pillar: Application
category: process-crash
author: Prequel
description: |
KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
Expand All @@ -13,10 +14,7 @@ rules:
Until the ScaledObject is deleted or KEDA is upgraded, the KEDA operator will continue to crash when reconciling ScaledObjects.
impactScore: 4
tags:
- keda
- crash
- known-problem
- public
- bug
mitigation: |
- Upgrade to KEDA 2.16.1 or newer
- Deleting the ScaledObjects on the failing cluster will also allow KEDA recovered
Expand Down Expand Up @@ -44,4 +42,4 @@ rules:
match:
- value: "ResolveScaleTargetPodSpec"
- value: "scale_resolvers.go"
- value: "performGetScalersCache"
- value: "performGetScalersCache"
8 changes: 4 additions & 4 deletions rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ rules:
id: CRE-2025-0025
severity: 2
title: Kafka broker replication mismatch
category: message-queue-problem
pillar: Data
category: data-replication-failure
author: Prequel
description: |
When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.
Expand All @@ -13,9 +14,7 @@ rules:
Exceeding the available brokers with a higher replication factor can lead to failed topic creations, continuous log errors, and possible service disruption if critical internal topics (like consumer offsets or transaction state) cannot be replicated.
impactScore: 6
tags:
- kafka
- known-problem
- public
- misconfiguration
mitigation: |
Match or lower the replication factor to the actual broker count, or scale up the Kafka cluster to accommodate the higher replication factor.
mitigationScore: 5
Expand All @@ -28,6 +27,7 @@ rules:
version: 3.x
- name: Kafka
version: 4.x

metadata:
kind: prequel
id: LikPvDPTX5kEKiR3EoBEM7
Expand Down
9 changes: 3 additions & 6 deletions rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,15 @@ rules:
id: CRE-2025-0026
severity: 3
title: AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName
category: storage-problem
pillar: Data
category: persistence-failure
author: Prequel
description: |
In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup.
cause: |
The controller attempts to locate the node based on `VolumeAttachment.spec.nodeName`. If this field is empty, the controller's logic skips processing, leading to a failure in detachment flow. This commonly happens when a VolumeAttachment is deleted before node assignment completes.
tags:
- ebs
- csi
- aws
- storage
- public
- bug
mitigation: |
- Upgrade to aws-ebs-csi-driver v1.26.1 or later.
- Avoid deleting PVCs or terminating pods immediately after volume provisioning.
Expand Down
12 changes: 4 additions & 8 deletions rules/cre-2025-0027/neutron-ovn-allows-port.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,16 @@ rules:
id: CRE-2025-0027
severity: 3
title: Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts
category: networking-problem
pillar: Networking
category: connectivity-timeout
author: Prequel
description: |
In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully.
cause: |
The OVN mechanism driver did not verify the liveness of agents before binding ports. Consequently, ports could be bound to non-responsive agents, resulting in failures during the virtual interface (VIF) plug process.
tags:
- neutron
- ovn
- timeout
- networking
- openstack
- known-issue
- public
- node-unresponsive
- bug
mitigation: |
- Upgrade Neutron to a version that includes the fix for this issue:
- Master branch: commit `8a55f091925fd5e6742fb92783c524450843f5a0`\n
Expand Down
13 changes: 4 additions & 9 deletions rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,16 @@ rules:
id: CRE-2025-0029
severity: 3
title: Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA
category: storage-problem
pillar: Security
category: authorization-violation
author: Prequel
description: |
- When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly.
cause: |
- The issue arises when the Loki configuration includes a custom `endpoint` for S3 and relies on IRSA for authentication. In such cases, Loki encounters a `WebIdentityErr` and `SerializationError` due to improper handling of credential retrieval with the specified endpoint.
tags:
- loki
- s3
- aws
- irsa
- storage
- authentication
- helm
- public
- permission-denied
- misconfiguration
mitigation: |
- In your Helm chart values, explicitly set `accessKeyId` and `secretAccessKey` to `null` to prevent default values from interfering with IRSA authentication.
references:
Expand Down
12 changes: 3 additions & 9 deletions rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,15 @@ rules:
id: CRE-2025-0034
severity: 2
title: Datadog agent disabled due to missing API key
category: observability-problem
pillar: Application
category: configuration-error
author: Prequel
description: |
If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.
cause: |
The environment variable `DD_API_KEY` was not set, or was set to an empty value in the container or application environment. As a result, the Datadog agent or SDK initializes in a no-op mode and skips exporting telemetry.
tags:
- datadog
- configuration
- api-key
- observability
- environment
- telemetry
- known-issue
- public
- misconfiguration
mitigation: |
- Ensure that the `DD_API_KEY` environment variable is present and correctly populated in the container or deployment spec.
- Use Kubernetes secrets or external secret stores to safely inject credentials into the runtime environment.
Expand Down
13 changes: 3 additions & 10 deletions rules/cre-2025-0036/opentelemetry-collector-drops.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,15 @@ rules:
id: CRE-2025-0036
severity: 3
title: OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target
category: observability-problem
pillar: Application
category: configuration-error
author: Prequel
description: |
The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.
cause: |
The backend server (e.g., metrics platform, observability vendor) returned an HTTP 413 status, rejecting the payload as too large. The exporter component does not automatically retry unless `retry_on_failure` is enabled. As a result, the data is dropped permanently.
tags:
- otel-collector
- exporter
- payload
- batch
- drop
- observability
- telemetry
- known-issue
- public
- invalid-payload
mitigation: |
- Enable `retry_on_failure` in the relevant exporter config to allow retry logic.
- Reduce batch size via `sending_queue` settings or exporter-specific `timeout`/`flush_interval` configurations.
Expand Down
15 changes: 4 additions & 11 deletions rules/cre-2025-0037/opentelemetry-collector-panic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,16 @@ rules:
id: CRE-2025-0037
severity: 3
title: OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator
category: observability-problem
pillar: Application
category: process-crash
author: Prequel
description: |
The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.
cause: |
The Prometheus Remote Write translator (`createAttributes`) iterates over attribute maps using `.Range` and directly calls `.AsString()` on a `pcommon.Value` without checking its type or for nil values. If the internal protobuf-backed `AnyValue` is unset or incompatible, it triggers a Go panic.
tags:
- crash
- prometheus
- otel-collector
- exporter
- panic
- translation
- attribute
- nil-pointer
- known-issue
- public
- nil-pointer-dereference
- bug
mitigation: |
- Upgrade to a release of `opentelemetry-collector-contrib` after v0.115.0 if available.
- Patch your local copy of `createAttributes()` to check `value.Type()` before calling `.AsString()`.
Expand Down
Loading
Loading