prequel-dev · amanycodes · Jun 27, 2025
diff --git a/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml b/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml
@@ -3,7 +3,8 @@ rules:
       id: CRE-2024-0007
       severity: 0
       title: RabbitMQ Mnesia overloaded recovering persistent queues
-      category: message-queue-problem
+      pillar: Infrastructure
+      category: resource-exhaustion-cpu
       author: Prequel
       description: |
         The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. The underlying Erlang process, Mnesia, is overloaded (`** WARNING ** Mnesia is overloaded`).
@@ -13,9 +14,7 @@ rules:
         RabbitMQ is unable to process any new messages, which can lead to outages in consumers and producers.
       impactScore: 9
       tags: 
-        - known-problem
-        - rabbitmq
-        - public
+        - high-cpu-usage
       mitigation: |
         - Increase the size of the cluster
         - Increase the Kubernetes CPU limits for the RabbitMQ brokers
@@ -46,4 +45,4 @@ rules:
             slide: 30s
           - value: "SIGTERM received - shutting down"
             anchor: 1
-            window: 10s
+            window: 10s
diff --git a/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml b/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml
@@ -3,7 +3,8 @@ rules:
       id: CRE-2024-0008
       severity: 1
       title: RabbitMQ memory alarm
-      category: message-queue-problem
+      pillar: Infrastructure
+      category: resource-exhaustion-memory
       author: Prequel
       description: |
         A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables,
@@ -22,9 +23,8 @@ rules:
         - Application components that rely on timely message delivery may experience delays or timeouts, degrading user-facing services.
       impactScore: 9
       tags: 
-        - known-problem
-        - rabbitmq
-        - public
+        - oom-error
+        - back-pressure
       mitigation: |
         - Inspect memory usage to identify queues or processes consuming RAM: 
              `rabbitmq-diagnostics memory_breakdown -n <node>`  
@@ -56,4 +56,4 @@ rules:
         negate:
           - value: memory resource limit alarm cleared
             anchor: 1
-            window: 15s
+            window: 15s
diff --git a/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml b/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml
@@ -1,9 +1,10 @@
 rules:
   - cre:
-      id: CRE-2024-0008
+      id: CRE-2024-0014
       severity: 1
       title: RabbitMQ busy distribution port performance issue
-      category: message-queue-performance
+      pillar: Infrastructure
+      category: resource-exhaustion-connections
       author: Prequel
       description: |
         The Erlang VM has reported a **`busy_dist_port`** condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a
@@ -24,9 +25,8 @@ rules:
         - Severe cases can drop inter-node links, triggering partition-handling logic and service outages.
       impactScore: 8
       tags: 
-        - known-problem
-        - rabbitmq
-        - public
+        - high-latency
+        - invalid-payload
       mitigation: |
         - **Diagnose** – run `rabbitmq-diagnostics busy_dist_port` (3.13+) or inspect warnings to identify affected nodes.  
         - **Raise the buffer limit** – set `RABBITMQ_DISTRIBUTION_BUFFER_SIZE=512000` (≈ 512 MB) or pass `-zdbbl 512000` to the Erlang VM; restart the node.  Ensure the pod / host memory limit is increased accordingly (≥ 512 MB × node count).  
@@ -51,4 +51,4 @@ rules:
         event:
           source: cre.log.rabbitmq
         match:
-          - regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port"
+          - regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port"
diff --git a/rules/cre-2024-0016/gke-metrics-export-failed.yaml b/rules/cre-2024-0016/gke-metrics-export-failed.yaml
@@ -3,15 +3,13 @@ rules:
       id: CRE-2024-0016
       severity: 3
       title: Google Kubernetes Engine metrics agent failing to export metrics
-      category: observability-problem
+      category: persistence-failure
       author: Prequel
       description: The Google Kubernetes Engine metrics agent is failing to export metrics.
       cause: | 
         The GKE team is aware of this issue and is working on a fix.
       tags: 
-        - known-problem
-        - gke
-        - public
+        - upstream-failure
       mitigation: |
         `gcloud logging sinks update _Default --add-exclusion=name=exclude-unimportant-gke-metadata-server-logs,filter=' resource.type = "k8s_container" resource.labels.namespace_name = "kube-system" resource.labels.pod_name =~ "gke-metadata-server-.*" resource.labels.container_name = "gke-metadata-server" severity <= "INFO" '`
       mitigationScore: 2
@@ -30,6 +28,7 @@ rules:
           version: "1.30.x"
         - name: "Google Kubernetes Engine (GKE)"
           version: "1.31.x"
+      pillar: Data
     metadata:
       kind: prequel
       id: rBj7HEGesPj8suW6G3DvrJ
@@ -39,4 +38,4 @@ rules:
         event:
           source: cre.log.gke-metrics-agent
         match:
-          - regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/
+          - regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/
diff --git a/rules/cre-2024-0018/ovn-high-cpu-usage.yaml b/rules/cre-2024-0018/ovn-high-cpu-usage.yaml
@@ -3,7 +3,7 @@ rules:
       id: CRE-2024-0018
       severity: 2
       title: Neutron Open Virtual Network (OVN) high CPU usage
-      category: networking-problem
+      category: resource-exhaustion-cpu
       author: Prequel
       description: |
         OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or
@@ -13,10 +13,10 @@ rules:
         - Burst of logical-flow updates (security-groups, LB changes)  
         - Poll-loop bug in OVN ≤ 20.2.0
         - CPU contention with GPU workloads; no offload/D PDK
-      tags: 
-        - known-problem
-        - ovn
-        - public
+      tags:
+        - high-cpu-usage
+        - bug
+      pillar: Infrastructure
       mitigation: |
         Increase the OVN remote probe interval to 30 seconds:
         ```

diff --git a/rules/cre-2024-0043/nginx-upstream-failure.yaml b/rules/cre-2024-0043/nginx-upstream-failure.yaml
@@ -3,7 +3,8 @@ rules:
       id: CRE-2024-0043
       severity: 2
       title: NGINX Upstream DNS Failure
-      category: proxy-problems
+      pillar: Networking
+      category: connectivity-dns-failure
       author: Prequel
       description: |
         When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.
@@ -13,9 +14,7 @@ rules:
         Clients experience partial or total service interruptions until the upstream is restored or reconfigured.
       impactScore: 6
       tags:
-        - kafka
-        - known-problem
-        - public
+        - upstream-failure
       mitigation: |
         Provide a stable or redundant upstream configuration so NGINX can gracefully handle DNS resolution failures.
       mitigationScore: 5

diff --git a/rules/cre-2025-0021/keda-nil-pointer.yaml b/rules/cre-2025-0021/keda-nil-pointer.yaml
@@ -3,7 +3,8 @@ rules:
       id: CRE-2024-0021
       severity: 1
       title: KEDA operator reconciler ScaledObject panic
-      category: operator-problem
+      pillar: Application
+      category: process-crash
       author: Prequel
       description: |
         KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
@@ -13,10 +14,7 @@ rules:
         Until the ScaledObject is deleted or KEDA is upgraded, the KEDA operator will continue to crash when reconciling ScaledObjects.
       impactScore: 4
       tags:
-        - keda
-        - crash
-        - known-problem
-        - public
+        - bug
       mitigation: |
         - Upgrade to KEDA 2.16.1 or newer
         - Deleting the ScaledObjects on the failing cluster will also allow KEDA recovered
@@ -44,4 +42,4 @@ rules:
         match:
           - value: "ResolveScaleTargetPodSpec"
           - value: "scale_resolvers.go"
-          - value: "performGetScalersCache"
+          - value: "performGetScalersCache"
diff --git a/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml b/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml
@@ -3,7 +3,8 @@ rules:
       id: CRE-2025-0025
       severity: 2
       title: Kafka broker replication mismatch
-      category: message-queue-problem
+      pillar: Data
+      category: data-replication-failure
       author: Prequel
       description: |
         When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.
@@ -13,9 +14,7 @@ rules:
         Exceeding the available brokers with a higher replication factor can lead to failed topic creations, continuous log errors, and possible service disruption if critical internal topics (like consumer offsets or transaction state) cannot be replicated.
       impactScore: 6
       tags:
-        - kafka
-        - known-problem
-        - public
+        - misconfiguration
       mitigation: |
         Match or lower the replication factor to the actual broker count, or scale up the Kafka cluster to accommodate the higher replication factor.
       mitigationScore: 5
@@ -28,6 +27,7 @@ rules:
           version: 3.x
         - name: Kafka
           version: 4.x
+
     metadata:
       kind: prequel
       id: LikPvDPTX5kEKiR3EoBEM7

diff --git a/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml b/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml
@@ -3,18 +3,15 @@ rules:
     id: CRE-2025-0026
     severity: 3
     title: AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName
-    category: storage-problem
+    pillar: Data
+    category: persistence-failure
     author: Prequel
     description: |
       In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup.
     cause: |
       The controller attempts to locate the node based on `VolumeAttachment.spec.nodeName`. If this field is empty, the controller's logic skips processing, leading to a failure in detachment flow. This commonly happens when a VolumeAttachment is deleted before node assignment completes.
     tags:
-      - ebs
-      - csi
-      - aws
-      - storage 
-      - public
+      - bug
     mitigation: | 
       - Upgrade to aws-ebs-csi-driver v1.26.1 or later.
       - Avoid deleting PVCs or terminating pods immediately after volume provisioning.

diff --git a/rules/cre-2025-0027/neutron-ovn-allows-port.yaml b/rules/cre-2025-0027/neutron-ovn-allows-port.yaml
@@ -3,20 +3,16 @@ rules:
     id: CRE-2025-0027
     severity: 3
     title: Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts
-    category: networking-problem
+    pillar: Networking
+    category: connectivity-timeout
     author: Prequel
     description: |
       In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully.
     cause: | 
       The OVN mechanism driver did not verify the liveness of agents before binding ports. Consequently, ports could be bound to non-responsive agents, resulting in failures during the virtual interface (VIF) plug process.
     tags:
-      - neutron
-      - ovn
-      - timeout
-      - networking
-      - openstack
-      - known-issue
-      - public
+      - node-unresponsive
+      - bug
     mitigation: |
       - Upgrade Neutron to a version that includes the fix for this issue:
         - Master branch: commit `8a55f091925fd5e6742fb92783c524450843f5a0`\n  

diff --git a/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml b/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml
@@ -3,21 +3,16 @@ rules:
     id: CRE-2025-0029
     severity: 3
     title: Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA
-    category: storage-problem
+    pillar: Security
+    category: authorization-violation
     author: Prequel
     description: |
       - When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly.
     cause: |
       - The issue arises when the Loki configuration includes a custom `endpoint` for S3 and relies on IRSA for authentication. In such cases, Loki encounters a `WebIdentityErr` and `SerializationError` due to improper handling of credential retrieval with the specified endpoint.
     tags:
-      - loki
-      - s3
-      - aws
-      - irsa
-      - storage
-      - authentication
-      - helm
-      - public
+      - permission-denied
+      - misconfiguration
     mitigation: |
       - In your Helm chart values, explicitly set `accessKeyId` and `secretAccessKey` to `null` to prevent default values from interfering with IRSA authentication.
     references:

diff --git a/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml b/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml
@@ -3,21 +3,15 @@ rules:
     id: CRE-2025-0034
     severity: 2
     title: Datadog agent disabled due to missing API key
-    category: observability-problem
+    pillar: Application
+    category: configuration-error
     author: Prequel
     description: |
       If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.
     cause: |
       The environment variable `DD_API_KEY` was not set, or was set to an empty value in the container or application environment. As a result, the Datadog agent or SDK initializes in a no-op mode and skips exporting telemetry.
     tags:
-      - datadog
-      - configuration
-      - api-key
-      - observability
-      - environment
-      - telemetry
-      - known-issue
-      - public
+      - misconfiguration
     mitigation: |
       - Ensure that the `DD_API_KEY` environment variable is present and correctly populated in the container or deployment spec.
       - Use Kubernetes secrets or external secret stores to safely inject credentials into the runtime environment.

diff --git a/rules/cre-2025-0036/opentelemetry-collector-drops.yaml b/rules/cre-2025-0036/opentelemetry-collector-drops.yaml
@@ -3,22 +3,15 @@ rules:
     id: CRE-2025-0036
     severity: 3
     title: OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target
-    category: observability-problem
+    pillar: Application
+    category: configuration-error
     author: Prequel
     description:  |
       The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.
     cause: |
       The backend server (e.g., metrics platform, observability vendor) returned an HTTP 413 status, rejecting the payload as too large. The exporter component does not automatically retry unless `retry_on_failure` is enabled. As a result, the data is dropped permanently.
     tags:
-      - otel-collector
-      - exporter
-      - payload
-      - batch
-      - drop
-      - observability
-      - telemetry
-      - known-issue
-      - public
+      - invalid-payload
     mitigation: |
       - Enable `retry_on_failure` in the relevant exporter config to allow retry logic.
       - Reduce batch size via `sending_queue` settings or exporter-specific `timeout`/`flush_interval` configurations.

diff --git a/rules/cre-2025-0037/opentelemetry-collector-panic.yaml b/rules/cre-2025-0037/opentelemetry-collector-panic.yaml
@@ -3,23 +3,16 @@ rules:
     id: CRE-2025-0037
     severity: 3
     title: OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator
-    category: observability-problem
+    pillar: Application
+    category: process-crash
     author: Prequel
     description: |
       The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.
     cause: | 
       The Prometheus Remote Write translator (`createAttributes`) iterates over attribute maps using `.Range` and directly calls `.AsString()` on a `pcommon.Value` without checking its type or for nil values. If the internal protobuf-backed `AnyValue` is unset or incompatible, it triggers a Go panic.
     tags:
-      - crash
-      - prometheus
-      - otel-collector
-      - exporter
-      - panic
-      - translation
-      - attribute
-      - nil-pointer
-      - known-issue
-      - public
+      - nil-pointer-dereference
+      - bug
     mitigation: |
       - Upgrade to a release of `opentelemetry-collector-contrib` after v0.115.0 if available.
       - Patch your local copy of `createAttributes()` to check `value.Type()` before calling `.AsString()`.