diff --git a/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml b/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml index 216389b..99da472 100644 --- a/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml +++ b/rules/cre-2024-0007/rabbitmq-mnesia-overloaded.yaml @@ -3,7 +3,8 @@ rules: id: CRE-2024-0007 severity: 0 title: RabbitMQ Mnesia overloaded recovering persistent queues - category: message-queue-problem + pillar: Infrastructure + category: resource-exhaustion-cpu author: Prequel description: | The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. The underlying Erlang process, Mnesia, is overloaded (`** WARNING ** Mnesia is overloaded`). @@ -13,9 +14,7 @@ rules: RabbitMQ is unable to process any new messages, which can lead to outages in consumers and producers. impactScore: 9 tags: - - known-problem - - rabbitmq - - public + - high-cpu-usage mitigation: | - Increase the size of the cluster - Increase the Kubernetes CPU limits for the RabbitMQ brokers @@ -46,4 +45,4 @@ rules: slide: 30s - value: "SIGTERM received - shutting down" anchor: 1 - window: 10s \ No newline at end of file + window: 10s diff --git a/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml b/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml index 1ad4872..f7c078e 100644 --- a/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml +++ b/rules/cre-2024-0008/rabbitmq-memory-alarm.yaml @@ -3,7 +3,8 @@ rules: id: CRE-2024-0008 severity: 1 title: RabbitMQ memory alarm - category: message-queue-problem + pillar: Infrastructure + category: resource-exhaustion-memory author: Prequel description: | A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables, @@ -22,9 +23,8 @@ rules: - Application components that rely on timely message delivery may experience delays or timeouts, degrading user-facing services. impactScore: 9 tags: - - known-problem - - rabbitmq - - public + - oom-error + - back-pressure mitigation: | - Inspect memory usage to identify queues or processes consuming RAM: `rabbitmq-diagnostics memory_breakdown -n ` @@ -56,4 +56,4 @@ rules: negate: - value: memory resource limit alarm cleared anchor: 1 - window: 15s \ No newline at end of file + window: 15s diff --git a/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml b/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml index 1f42b43..aaecece 100644 --- a/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml +++ b/rules/cre-2024-0014/rabbitmq-busy-dist-port.yaml @@ -1,9 +1,10 @@ rules: - cre: - id: CRE-2024-0008 + id: CRE-2024-0014 severity: 1 title: RabbitMQ busy distribution port performance issue - category: message-queue-performance + pillar: Infrastructure + category: resource-exhaustion-connections author: Prequel description: | The Erlang VM has reported a **`busy_dist_port`** condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a @@ -24,9 +25,8 @@ rules: - Severe cases can drop inter-node links, triggering partition-handling logic and service outages. impactScore: 8 tags: - - known-problem - - rabbitmq - - public + - high-latency + - invalid-payload mitigation: | - **Diagnose** – run `rabbitmq-diagnostics busy_dist_port` (3.13+) or inspect warnings to identify affected nodes. - **Raise the buffer limit** – set `RABBITMQ_DISTRIBUTION_BUFFER_SIZE=512000` (≈ 512 MB) or pass `-zdbbl 512000` to the Erlang VM; restart the node. Ensure the pod / host memory limit is increased accordingly (≥ 512 MB × node count). @@ -51,4 +51,4 @@ rules: event: source: cre.log.rabbitmq match: - - regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port" \ No newline at end of file + - regex: "[warning](.+)rabbit_sysmon_handler busy_dist_port" diff --git a/rules/cre-2024-0016/gke-metrics-export-failed.yaml b/rules/cre-2024-0016/gke-metrics-export-failed.yaml index cc9c5d6..68b61e7 100644 --- a/rules/cre-2024-0016/gke-metrics-export-failed.yaml +++ b/rules/cre-2024-0016/gke-metrics-export-failed.yaml @@ -3,15 +3,13 @@ rules: id: CRE-2024-0016 severity: 3 title: Google Kubernetes Engine metrics agent failing to export metrics - category: observability-problem + category: persistence-failure author: Prequel description: The Google Kubernetes Engine metrics agent is failing to export metrics. cause: | The GKE team is aware of this issue and is working on a fix. tags: - - known-problem - - gke - - public + - upstream-failure mitigation: | `gcloud logging sinks update _Default --add-exclusion=name=exclude-unimportant-gke-metadata-server-logs,filter=' resource.type = "k8s_container" resource.labels.namespace_name = "kube-system" resource.labels.pod_name =~ "gke-metadata-server-.*" resource.labels.container_name = "gke-metadata-server" severity <= "INFO" '` mitigationScore: 2 @@ -30,6 +28,7 @@ rules: version: "1.30.x" - name: "Google Kubernetes Engine (GKE)" version: "1.31.x" + pillar: Data metadata: kind: prequel id: rBj7HEGesPj8suW6G3DvrJ @@ -39,4 +38,4 @@ rules: event: source: cre.log.gke-metrics-agent match: - - regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/ \ No newline at end of file + - regex: Exporting failed(.+)Please retry(.+)If internal errors persist, contact support at https://cloud.google.com/support/ diff --git a/rules/cre-2024-0018/ovn-high-cpu-usage.yaml b/rules/cre-2024-0018/ovn-high-cpu-usage.yaml index 01a43bc..402048c 100644 --- a/rules/cre-2024-0018/ovn-high-cpu-usage.yaml +++ b/rules/cre-2024-0018/ovn-high-cpu-usage.yaml @@ -3,7 +3,7 @@ rules: id: CRE-2024-0018 severity: 2 title: Neutron Open Virtual Network (OVN) high CPU usage - category: networking-problem + category: resource-exhaustion-cpu author: Prequel description: | OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or @@ -13,10 +13,10 @@ rules: - Burst of logical-flow updates (security-groups, LB changes) - Poll-loop bug in OVN ≤ 20.2.0 - CPU contention with GPU workloads; no offload/D PDK - tags: - - known-problem - - ovn - - public + tags: + - high-cpu-usage + - bug + pillar: Infrastructure mitigation: | Increase the OVN remote probe interval to 30 seconds: ``` diff --git a/rules/cre-2024-0043/nginx-upstream-failure.yaml b/rules/cre-2024-0043/nginx-upstream-failure.yaml index c61abd4..945d59a 100644 --- a/rules/cre-2024-0043/nginx-upstream-failure.yaml +++ b/rules/cre-2024-0043/nginx-upstream-failure.yaml @@ -3,7 +3,8 @@ rules: id: CRE-2024-0043 severity: 2 title: NGINX Upstream DNS Failure - category: proxy-problems + pillar: Networking + category: connectivity-dns-failure author: Prequel description: | When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail. @@ -13,9 +14,7 @@ rules: Clients experience partial or total service interruptions until the upstream is restored or reconfigured. impactScore: 6 tags: - - kafka - - known-problem - - public + - upstream-failure mitigation: | Provide a stable or redundant upstream configuration so NGINX can gracefully handle DNS resolution failures. mitigationScore: 5 diff --git a/rules/cre-2025-0021/keda-nil-pointer.yaml b/rules/cre-2025-0021/keda-nil-pointer.yaml index 51346d1..58e11cc 100644 --- a/rules/cre-2025-0021/keda-nil-pointer.yaml +++ b/rules/cre-2025-0021/keda-nil-pointer.yaml @@ -3,7 +3,8 @@ rules: id: CRE-2024-0021 severity: 1 title: KEDA operator reconciler ScaledObject panic - category: operator-problem + pillar: Application + category: process-crash author: Prequel description: | KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. @@ -13,10 +14,7 @@ rules: Until the ScaledObject is deleted or KEDA is upgraded, the KEDA operator will continue to crash when reconciling ScaledObjects. impactScore: 4 tags: - - keda - - crash - - known-problem - - public + - bug mitigation: | - Upgrade to KEDA 2.16.1 or newer - Deleting the ScaledObjects on the failing cluster will also allow KEDA recovered @@ -44,4 +42,4 @@ rules: match: - value: "ResolveScaleTargetPodSpec" - value: "scale_resolvers.go" - - value: "performGetScalersCache" \ No newline at end of file + - value: "performGetScalersCache" diff --git a/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml b/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml index 5ac96ba..4f8bf1e 100644 --- a/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml +++ b/rules/cre-2025-0025/kafka-broker-replication-mismatch.yaml @@ -3,7 +3,8 @@ rules: id: CRE-2025-0025 severity: 2 title: Kafka broker replication mismatch - category: message-queue-problem + pillar: Data + category: data-replication-failure author: Prequel description: | When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics. @@ -13,9 +14,7 @@ rules: Exceeding the available brokers with a higher replication factor can lead to failed topic creations, continuous log errors, and possible service disruption if critical internal topics (like consumer offsets or transaction state) cannot be replicated. impactScore: 6 tags: - - kafka - - known-problem - - public + - misconfiguration mitigation: | Match or lower the replication factor to the actual broker count, or scale up the Kafka cluster to accommodate the higher replication factor. mitigationScore: 5 @@ -28,6 +27,7 @@ rules: version: 3.x - name: Kafka version: 4.x + metadata: kind: prequel id: LikPvDPTX5kEKiR3EoBEM7 diff --git a/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml b/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml index 509cb8f..6b4d385 100644 --- a/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml +++ b/rules/cre-2025-0026/aws-ebs-csi-driver-fails-to.yaml @@ -3,18 +3,15 @@ rules: id: CRE-2025-0026 severity: 3 title: AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName - category: storage-problem + pillar: Data + category: persistence-failure author: Prequel description: | In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup. cause: | The controller attempts to locate the node based on `VolumeAttachment.spec.nodeName`. If this field is empty, the controller's logic skips processing, leading to a failure in detachment flow. This commonly happens when a VolumeAttachment is deleted before node assignment completes. tags: - - ebs - - csi - - aws - - storage - - public + - bug mitigation: | - Upgrade to aws-ebs-csi-driver v1.26.1 or later. - Avoid deleting PVCs or terminating pods immediately after volume provisioning. diff --git a/rules/cre-2025-0027/neutron-ovn-allows-port.yaml b/rules/cre-2025-0027/neutron-ovn-allows-port.yaml index 52c887a..2f65593 100644 --- a/rules/cre-2025-0027/neutron-ovn-allows-port.yaml +++ b/rules/cre-2025-0027/neutron-ovn-allows-port.yaml @@ -3,20 +3,16 @@ rules: id: CRE-2025-0027 severity: 3 title: Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts - category: networking-problem + pillar: Networking + category: connectivity-timeout author: Prequel description: | In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully. cause: | The OVN mechanism driver did not verify the liveness of agents before binding ports. Consequently, ports could be bound to non-responsive agents, resulting in failures during the virtual interface (VIF) plug process. tags: - - neutron - - ovn - - timeout - - networking - - openstack - - known-issue - - public + - node-unresponsive + - bug mitigation: | - Upgrade Neutron to a version that includes the fix for this issue: - Master branch: commit `8a55f091925fd5e6742fb92783c524450843f5a0`\n diff --git a/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml b/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml index 9335c52..5268f87 100644 --- a/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml +++ b/rules/cre-2025-0029/loki-fails-to-retrieve-aws.yaml @@ -3,21 +3,16 @@ rules: id: CRE-2025-0029 severity: 3 title: Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA - category: storage-problem + pillar: Security + category: authorization-violation author: Prequel description: | - When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly. cause: | - The issue arises when the Loki configuration includes a custom `endpoint` for S3 and relies on IRSA for authentication. In such cases, Loki encounters a `WebIdentityErr` and `SerializationError` due to improper handling of credential retrieval with the specified endpoint. tags: - - loki - - s3 - - aws - - irsa - - storage - - authentication - - helm - - public + - permission-denied + - misconfiguration mitigation: | - In your Helm chart values, explicitly set `accessKeyId` and `secretAccessKey` to `null` to prevent default values from interfering with IRSA authentication. references: diff --git a/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml b/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml index 85b00b3..895565c 100644 --- a/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml +++ b/rules/cre-2025-0034/datadog-agent-disabled-due-to.yaml @@ -3,21 +3,15 @@ rules: id: CRE-2025-0034 severity: 2 title: Datadog agent disabled due to missing API key - category: observability-problem + pillar: Application + category: configuration-error author: Prequel description: | If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages. cause: | The environment variable `DD_API_KEY` was not set, or was set to an empty value in the container or application environment. As a result, the Datadog agent or SDK initializes in a no-op mode and skips exporting telemetry. tags: - - datadog - - configuration - - api-key - - observability - - environment - - telemetry - - known-issue - - public + - misconfiguration mitigation: | - Ensure that the `DD_API_KEY` environment variable is present and correctly populated in the container or deployment spec. - Use Kubernetes secrets or external secret stores to safely inject credentials into the runtime environment. diff --git a/rules/cre-2025-0036/opentelemetry-collector-drops.yaml b/rules/cre-2025-0036/opentelemetry-collector-drops.yaml index 25292eb..fe7ee9f 100644 --- a/rules/cre-2025-0036/opentelemetry-collector-drops.yaml +++ b/rules/cre-2025-0036/opentelemetry-collector-drops.yaml @@ -3,22 +3,15 @@ rules: id: CRE-2025-0036 severity: 3 title: OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target - category: observability-problem + pillar: Application + category: configuration-error author: Prequel description: | The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled. cause: | The backend server (e.g., metrics platform, observability vendor) returned an HTTP 413 status, rejecting the payload as too large. The exporter component does not automatically retry unless `retry_on_failure` is enabled. As a result, the data is dropped permanently. tags: - - otel-collector - - exporter - - payload - - batch - - drop - - observability - - telemetry - - known-issue - - public + - invalid-payload mitigation: | - Enable `retry_on_failure` in the relevant exporter config to allow retry logic. - Reduce batch size via `sending_queue` settings or exporter-specific `timeout`/`flush_interval` configurations. diff --git a/rules/cre-2025-0037/opentelemetry-collector-panic.yaml b/rules/cre-2025-0037/opentelemetry-collector-panic.yaml index 3250f53..9f75bb7 100644 --- a/rules/cre-2025-0037/opentelemetry-collector-panic.yaml +++ b/rules/cre-2025-0037/opentelemetry-collector-panic.yaml @@ -3,23 +3,16 @@ rules: id: CRE-2025-0037 severity: 3 title: OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator - category: observability-problem + pillar: Application + category: process-crash author: Prequel description: | The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector. cause: | The Prometheus Remote Write translator (`createAttributes`) iterates over attribute maps using `.Range` and directly calls `.AsString()` on a `pcommon.Value` without checking its type or for nil values. If the internal protobuf-backed `AnyValue` is unset or incompatible, it triggers a Go panic. tags: - - crash - - prometheus - - otel-collector - - exporter - - panic - - translation - - attribute - - nil-pointer - - known-issue - - public + - nil-pointer-dereference + - bug mitigation: | - Upgrade to a release of `opentelemetry-collector-contrib` after v0.115.0 if available. - Patch your local copy of `createAttributes()` to check `value.Type()` before calling `.AsString()`. diff --git a/rules/tags/categories.yaml b/rules/tags/categories.yaml index d352f8e..e3f41be 100644 --- a/rules/tags/categories.yaml +++ b/rules/tags/categories.yaml @@ -3,192 +3,80 @@ metadata: id: xxtdmttg55Pg844dRw8Ti6 gen: 1 categories: - - name: api-problem - displayName: API Problems - description: Problems related to well-known external APIs - - name: message-queue-problem - displayName: Message Queue Problems - description: Problems related to message queues, like Kafka, RabbitMQ, NATS and others - - name: asynchronous-task-problem - displayName: Asynchronous Task Problems - description: Problems related to asynchronous tasks, like Celery, Sidekiq, and others - - name: database-problem - displayName: Database Problems - description: Problems related to databases, like MySQL, PostgreSQL, MongoDB, and others - - name: proxy-problems - displayName: Proxy Problems - description: Problems related to proxies, like NGINX, HAProxy, and others - - name: disaster-recovery-problems - displayName: Disaster Recovery Problems - description: Problems related to disaster recovery, like backup and restore, and failover - - name: memory-problem - displayName: Memory Problems - description: Problems related to memory, like out-of-memory crashes, memory leaks, and memory pressure - - name: scaling-problem - displayName: Scaling Problem - description: Problems related to scaling, like autoscaling, scale-up, and scale-down events - - name: noisy-neighbor-problem - displayName: Noisy Neighbor Problems - description: Problems related to noisy neighbors in shared infrastructure - - name: library-problem - displayName: Library Problems - description: Problems related to libraries - - name: cloud-provider-problem - displayName: Cloud Provider Problems - description: Problems related to cloud providers, like AWS, GCP, Azure, and others - - name: unhealthy-event - displayName: Unhealthy Kubernetes Events - description: Problems related to unhealthy events, like probe failures and others - - name: observability-problem - displayName: Observability Problems - description: Problems related to observability, like monitoring, logging, and tracing - - name: operator-problem - displayName: Operator Problems - description: Problems related to operators - - name: networking-problem - displayName: Networking Problems - description: Connectivity, DNS, or routing issues affecting system communication. - - name: runtime-problem - displayName: Runtime - description: Problems within runtime environments such as Python, Java, or Node.js. - - name: storage-problem - displayName: Storage - description: Disk, object storage, or volume-related issues that impact data availability. - - name: orm-problem - displayName: Orm - description: Object Relational Mapper issue that impacts data availability. - - name: cache-problem - displayName: Cache Problems - description: Cache related problems - - name: framework-problem - displayName: Framework Problems - description: Problems in frameworks such as Django - - name: distributed-messaging-connectivity - displayName: Distributed Messaging Connectivity Issues - description: Failures in distributed messaging systems where message brokers, consumers, or producers lose connectivity or coordination, leading to processing halts or data loss - - name: workflow-orchestration-connectivity - displayName: Workflow Orchestration Connectivity - description: Connection failures between workflow orchestration components like Temporal workers and servers - - name: authorization-problem - displayName: Authorization Problems - description: Problems related to authorization - - name: insecure-configuration - displayName: Insecure Configuration - description: Problems related to insecure configuration - - name: message-queue-performance - displayName: Message Queue Performance - description: Problems related to message queue performance - - name: kubernetes-api-problems - displayName: Kubernetes API Problem - description: Problems related to the Kubernetes API - - name: web-server-problems - displayName: Web Server Problem - description: Problems related to web servers - - name: kubernetes-problem - displayName: Kubernetes Problems - description: Problems related to Kubernetes - - name: load-balancer-problems - displayName: Load Balancer Problems - description: Problems related to load balancers - - name: proxy-timeout-problem - displayName: Proxy Timeout Problems - description: Problems related to proxy timeouts - - name: web-server-problem - displayName: Web Server Problems - description: Problems related to web servers - - name: configuration-problem - displayName: Configuration Problem - description: Problems related to system or application configurations - - name: monitoring-problem - displayName: Monitoring Problem - description: Problems related to system or application monitoring - - name: noise-problem - displayName: Noise Problem - description: Problems related to unwanted or irrelevant logs and alerts - - name: service-mesh-problem - displayName: Service Mesh Problem - description: Problems related to monitoring within a service mesh architecture - - name: incompatibility-problem - displayName: Incompatibility Problem - description: Problems due to incompatible components or versions. - - name: instability-problem - displayName: Instability Problem - description: Issues causing system instability or crashes. - - name: service-mesh-problems - displayName: Service Mesh Problems - description: Problems specific to monitoring within a service mesh. - - name: logging-problem - displayName: Logging Problems - description: Issues related to logging mechanisms and processes. - - name: provisioning-problem - displayName: Provisioning Problems - description: Issues related to the provisioning of resources and infrastructure. - - name: stability-problem - displayName: Stability Problems - description: Issues that affect the stability and uptime of systems and services. - - name: task-management-problem - displayName: Task Management Problems - description: Issues related to the management and execution of tasks and workflows. - - name: ubuntu-desktop-problem - displayName: Ubuntu Desktop Problems - description: "Problems related to Ubuntu Desktop" - - name: hpc-database-problem - displayName: HPC Database Problems - description: Database issues specific to high-performance computing systems like SLURM - - name: in-memory-database-problem - displayName: In-Memory Database Problems - description: Problems specific to in-memory data stores (e.g. Redis, Memcached) - - name: postgres-ha - displayName: PostgreSQL High Availability - description: High-severity problems related to PostgreSQL in high-availability (HA) clusters, including replication, failover, WAL streaming, and HA controller outages. - - name: kubernetes-storage-problems - displayName: Kubernetes Storage Problems - description: Problems related to container storage in Kubernetes - - name: demo-problems - displayName: Demo Problems - description: This is a category for demos - - name: redpanda-high-availability - displayName: Redpanda High Availability - description: High-severity issues related to quorum, leader election, Raft consensus, and node isolation in Redpanda clusters - - name: distributed-worker-connectivity - displayName: Distributed Worker Connectivity Issues - description: Failures where a distributed systems worker fails to reach or stay connected to the orchestration backend (e.g., Temporal, Celery). - - - name: redpanda-problems - displayName: Redpanda Problems - description: Problems related to Redpanda cluster failures, including node loss, quorum loss, and data availability impact. - - - name: authorization-systems - displayName: Authorization Systems - description: | - Failures in systems that manage access control, identity, or permissions. - This includes tools like SpiceDB, OPA, or Auth0 where schema, policy, or - integration issues can block authentication or authorization flows. + # --- Pillar: Application --- + - name: process-crash + pillar: Application + displayName: Process Crash + description: A process has terminated unexpectedly due to an unhandled exception, panic, or segmentation fault. + - name: task-processing-stall + pillar: Application + displayName: Task Processing Stall + description: A worker or task-based system stops processing new items from its queue, often without crashing. + - name: configuration-error + pillar: Application + displayName: Configuration Error + description: The system fails due to an invalid, missing, or improperly structured setting in its configuration. + - name: version-incompatibility + pillar: Application + displayName: Version Incompatibility + description: The system fails because two or more of its components have conflicting or unsupported versions. + # --- Pillar: Infrastructure --- + - name: resource-exhaustion-memory + pillar: Infrastructure + displayName: Memory Exhaustion + description: The system fails because it has run out of available memory (RAM), triggering OOM errors or high-watermark alarms. + - name: resource-exhaustion-disk + pillar: Infrastructure + displayName: Disk Exhaustion + description: A storage volume has no space left, preventing write operations, logging, or persistence. + - name: resource-exhaustion-cpu + pillar: Infrastructure + displayName: CPU Starvation + description: The system's performance degrades or fails due to one or more processes consuming all available CPU resources. + - name: resource-exhaustion-connections + pillar: Infrastructure + displayName: Connection Pool Exhaustion + description: The system cannot accept new connections because it has reached the maximum limit of concurrent connections or file handles. - - name: SpiceDB-datastore-failure - displayName: SpiceDB Datastore Failure - description: Failures in the datastore used by SpiceDB, which can lead to authentication or authorization issues. + # --- Pillar: Data --- + - name: data-corruption + pillar: Data + displayName: Data Corruption + description: Stored data has been damaged, is unreadable, or violates its expected format or schema. + - name: data-replication-failure + pillar: Data + displayName: Data Replication Failure + description: The process of copying data between nodes in a distributed system has failed, leading to state inconsistency. + - name: consensus-failure + pillar: Data + displayName: Consensus Failure + description: A distributed system has lost quorum or cannot agree on state, preventing new operations and causing leader election failures. + - name: persistence-failure + pillar: Data + displayName: Persistence Failure + description: The system fails to correctly save its state to durable storage, such as a failed database snapshot or transaction log write error. - - name: authorization-system-problem - displayName: Authorization System Problems - description: Problems related to authorization and permission systems like SpiceDB, OPA, and similar fine-grained access control systems - - name: database-corruption-problem - displayName: Database Corruption Problems - description: Problems related to database corruption, including missing tables, schema issues, and data integrity failures - - name: access-control-problem - displayName: Access Control Problems - description: Problems related to access control systems that prevent legitimate access due to system failures rather than policy violations - - name: temporal-problem - displayName: Temporal Server Failure - description: Temporal Server Failure Temporal Server Fails Persistence on Read-Only Database - - - - name: data-streaming-platforms - displayName: Data Streaming Platforms - description: | - Failures in distributed streaming data platforms used for real-time event processing. - This includes platforms like Redpanda, Apache Kafka, Pulsar, and compatible systems - where startup, configuration, or operational issues can disrupt data streaming pipelines - and impact downstream applications relying on event-driven architectures. + # --- Pillar: Networking --- + - name: connectivity-dns-failure + pillar: Networking + displayName: DNS Resolution Failure + description: The system fails because it cannot resolve a hostname to an IP address. + - name: connectivity-refused + pillar: Networking + displayName: Connection Refused + description: A network connection attempt was actively rejected, often because no process was listening on the target port. + - name: connectivity-timeout + pillar: Networking + displayName: Connection Timeout + description: A network operation failed because a response was not received within a configured time limit. + # --- Pillar: Security --- + - name: authorization-violation + pillar: Security + displayName: Authorization Violation + description: An action is blocked because the client lacks the necessary permissions, credentials, or access rights. + - name: insecure-configuration + pillar: Security + displayName: Insecure Configuration + description: The system is configured in a way that exposes it to security risks, such as using weak ciphers or allowing untrusted hosts. diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml index b66204a..a066b8a 100644 --- a/rules/tags/tags.yaml +++ b/rules/tags/tags.yaml @@ -3,702 +3,51 @@ metadata: id: 9av5xggdtmVKNLUcPyGmN8 gen: 1 tags: - - name: aws - displayName: AWS - description: Amazon Web Services - - name: gcp - displayName: GCP - description: Google Cloud Platform - - name: azure - displayName: Azure - description: Microsoft Azure - - name: k8s - displayName: K8s - description: Kubernetes - - name: ecr - displayName: ECR - description: Elastic Container Registry - - name: threshold-exceeded - displayName: Threshold Exceeded - description: An external API limit has been exceeded and API requests are currently failing or being throttled - - name: known-problem - displayName: Known Problem - description: This is a documented known problem with known mitigations - - name: gunicorn - displayName: Gunicorn - description: Problems with Python Gunicorn - - name: kafka - displayName: Kafka - description: Problems with Apache Kafka - - name: known-anti-pattern - displayName: Known Anti-Pattern - description: This is a known anti-pattern that should be avoided - - name: postgres - displayName: PostgreSQL - description: Problems with PostgreSQL - - name: gke - displayName: GKE - description: Google Kubernetes Engine - - name: velero - displayName: Velero - description: Problems with Velero - - name: vmware - displayName: VMware - description: Problems with VMware - - name: karpenter - displayName: Karpenter - description: Problems with Karpenter - - name: eks - displayName: EKS - description: Amazon Elastic Kubernetes Service - - name: beta - displayName: Beta - description: Beta rules are experimental and may change in the future - - name: crash - displayName: Crash - description: Problems with applications crashing - - name: rabbitmq - displayName: RabbitMQ - description: Problems with RabbitMQ - - name: segfault - displayName: Segfault - description: Problems with applications segfaulting - - name: celery - displayName: Celery - description: Problems with Celery - - name: errors - displayName: Errors - description: Problems with application errors - - name: loki - displayName: Loki - description: Problems with Grafana Loki - - name: misconfiguration - displayName: Misconfiguration - description: Problems with misconfigurations - - name: keda - displayName: KEDA - description: Problems with KEDA Operator - - name: openstack - displayName: Openstack - description: Problems specific to OpenStack infrastructure components and deployments. - - name: opentelemetry - displayName: Opentelemetry - description: Errors or gaps in tracing and metrics collection using OpenTelemetry libraries. - - name: operational error - displayName: Operational error - description: A runtime issue caused by system-level factors like resource limits or connectivity. - - name: otel-collector - displayName: Otel Collector - description: Failures in OpenTelemetry Collector pipelines or exporters. - - name: ovn - displayName: Ovn - description: Issues in Open Virtual Network components used with SDN setups. - - name: ovsdb - displayName: Ovsdb - description: Failures involving the OVSDB (Open vSwitch Database) protocol or schema. - - name: panic - displayName: Panic - description: Crashes due to unrecoverable errors, especially in Go or Rust applications. - - name: password - displayName: Password - description: Problems with password policies, validation, or storage. - - name: plugin - displayName: Plugin - description: Failures or misbehavior in third-party or custom plugin systems. - - name: port-binding - displayName: Port Binding - description: Conflicts or failures when applications attempt to bind to ports. - - name: prometheus - displayName: Prometheus - description: Problems with scraping, rule evaluation, or querying Prometheus data. - - name: connection-refused-startup - displayName: Connection Refused on Startup - description: Failures that occur when a service (e.g., Temporal worker) tries to connect to a backend and receives a connection refused error. - - name: redpanda - displayName: Redpanda - description: Issues related to Redpanda streaming data platform - - name: consumer-groups - displayName: Consumer Groups - description: Kafka/Redpanda consumer group coordination failures - - name: coordinator-failure - displayName: Coordinator Failure - description: Failures in distributed system coordinators (group coordinators, cluster coordinators) - - name: mass-disconnect - displayName: Mass Disconnect - description: Scenarios where many clients/consumers disconnect simultaneously - - name: kafka-compatibility - displayName: Kafka Compatibility - description: Issues related to Kafka protocol compatibility in Redpanda - - name: message-processing-halt - displayName: Message Processing Halt - description: Complete stoppage of message processing in streaming systems - - name: psycopg2 - displayName: Psycopg2 - description: Python client errors related to connecting or querying PostgreSQL using psycopg2. - - name: python - displayName: Python - description: General Python runtime errors or stack traces. - - name: redis - displayName: Redis - description: Issues involving Redis availability, eviction policies, or timeouts. - - name: redis-cli - displayName: Redis CLI - description: Problems with the Redis command-line interface, such as connection issues, command errors or rejections. - - name: redis-py - displayName: Redis Py - description: Errors with the `redis-py` client library in Python. - - name: retry - displayName: Retry - description: Logic or policy failures when retrying failed operations. - - name: rownotfound - displayName: Rownotfound - description: Database query errors indicating expected data was not found. - - name: s3 - displayName: S3 - description: Errors related to object access, buckets, or permissions in Amazon S3. - - name: security - displayName: Security - description: Misconfigurations or vulnerabilities in authentication, authorization, or encryption. - - name: service - displayName: Service - description: Failures at the service or API layer of an application. - - name: signature - displayName: Signature - description: Problems with signing or verifying cryptographic signatures. - - name: sqlalchemy - displayName: Sqlalchemy - description: Errors in SQLAlchemy ORM usage, session handling, or migrations. - - name: ssl - displayName: Ssl - description: SSL/TLS handshake errors or expired/invalid certificates. - - name: storage - displayName: Storage - description: Failures in block, object, or ephemeral storage backends. - - name: telemetry - displayName: Telemetry - description: Issues with emitting, collecting, or transforming observability data. - - name: threads - displayName: Threads - description: Race conditions, deadlocks, or errors in multithreaded environments. - - name: timeout - displayName: Timeout - description: Operations that exceeded their allotted execution window. - - name: transaction - displayName: Transaction - description: Database or service transaction failures due to commits or rollbacks. - - name: translation - displayName: Translation - description: Errors in i18n/l10n string resolution or missing language assets. - - name: uri - displayName: Uri - description: Malformed or invalid Uniform Resource Identifier usage. - - name: validation - displayName: Validation - description: Input or schema validation failures in form submissions or APIs. - - name: vif - displayName: Vif - description: Virtual interface (VIF) creation or binding errors in cloud networking. - - name: web - displayName: Web - description: Browser-facing issues in HTTP, HTML, or frontend integration layers. - - name: ebs - displayName: ebs - description: Problems with Amazon EBS (Elastic Block Store). - - name: csi - displayName: csi - description: Container Storage Interface (CSI) - - name: api-key - displayName: Api Key - description: Problems related to API keys, such as missing, invalid, or expired credentials - - name: async - displayName: Async - description: Problems related to asynchronous execution, such as hung tasks, race conditions, or callback errors - - name: attribute - displayName: Attribute - description: Problems related to missing or unexpected object attributes, causing attribute access failures - - name: attributeerror - displayName: Attributeerror - description: Problems where code fails due to attribute lookup errors, such as missing attributes on objects - - name: authentication - displayName: Authentication - description: Problems related to user or service authentication, such as invalid tokens or failed logins - - name: backpressure - displayName: Backpressure - description: Problems where producers overwhelm consumers, causing resource exhaustion or unhandled pressure - - name: batch - displayName: Batch - description: Problems related to batch processing, such as job failures, incorrect batch sizing, or order issues - - name: cache - displayName: Cache - description: Problems related to caching mechanisms, including stale data, cache misses, or eviction faults - - name: configuration - displayName: Configuration - description: Problems caused by incorrect or missing configuration settings - - name: connection - displayName: Connection - description: Problems related to network connections, such as timeouts, refusals, or resets - - name: context - displayName: Context - description: Problems related to context propagation, such as lost, overwritten, or mismatched context values - - name: contextvars - displayName: Contextvars - description: Problems specifically with Python context variables, such as improper isolation or missing context - - name: data-loss - displayName: Data Loss - description: Problems where data is lost or dropped due to system failures or processing errors - - name: datadog - displayName: Datadog - description: Problems related to Datadog integration, such as missing metrics, reporting failures, or misconfigurations + - name: 404-not-found + displayName: 404 Not Found + - name: 502-bad-gateway + displayName: 502 Bad Gateway + - name: back-pressure + displayName: Back Pressure + - name: data-unavailable + displayName: Data Unavailable - name: deadlock displayName: Deadlock - description: Problems where threads or processes enter deadlock, preventing further progress - - name: disallowedhost - displayName: Disallowedhost - description: Problems where incoming requests are blocked due to disallowed Host header settings - - name: django - displayName: Django - description: Problems related to the Django framework, such as view errors, middleware faults, or misconfigurations - - name: drop - displayName: Drop - description: Problems where messages or data are unexpectedly dropped or discarded - - name: environment - displayName: Environment - description: Problems related to environment variables or runtime environment settings - - name: escaping - displayName: Escaping - description: Problems related to improper escaping of strings or data, leading to injection or parsing issues - - name: exporter - displayName: Exporter - description: Problems related to metric exporters, such as missing, malformed, or unreported metric data - - name: fork - displayName: Fork - description: Problems related to process forking, such as unsafe forks or resource duplication - - name: grafana - displayName: Grafana - description: Problems related to Grafana services, that may impact performance, or telemetry collection and storage - - name: helm - displayName: Helm - description: Problems related to Helm deployments, such as chart rendering failures or template errors - - name: host-header - displayName: Host Header - description: Problems due to incorrect or malicious Host header values - - name: infrastructure - displayName: Infrastructure - description: Problems at the infrastructure level, such as resource outages or provisioning failures - - name: instrumentation - displayName: Instrumentation - description: Problems related to instrumentation code, such as missing spans, broken traces, or metric gaps - - name: irsa - displayName: Irsa - description: Problems related to IAM Roles for Service Accounts (IRSA), such as permission denials or misbindings - - name: known-issue - displayName: Known Issue - description: Problems already identified and documented as known issues - - name: kubernetes - displayName: Kubernetes - description: Problems related to Kubernetes, such as pod failures, API errors, or scheduling issues - - name: load-balancer - displayName: Load Balancer - description: Problems related to load balancers, such as misrouting, unhealthy backends, or configuration faults - - name: logical-switch - displayName: Logical Switch - description: Problems related to logical switch configurations in virtual networking - - name: memcached - displayName: Memcached - description: Problems related to Memcached, such as cache evictions, connection errors, or stale entries - - name: memory - displayName: Memory - description: Problems related to memory usage, such as leaks, pressure, or out-of-memory crashes - - name: memory-pressure - displayName: Memory Pressure - description: Problems where applications or services experience high memory usage, leading to performance degradation or crashes - - name: metrics - displayName: Metrics - description: Problems related to metrics collection or reporting, such as missing, delayed, or incorrect data - - name: multiprocessing - displayName: Multiprocessing - description: Problems related to multiprocessing, such as process spawning failures or inter-process communication issues - - name: network - displayName: Network - description: Problems related to network communication, such as packet loss, latency spikes, or unreachable hosts - - name: networking - displayName: Networking - description: Problems within networking components, such as interface misconfigurations or routing errors - - name: neutron - displayName: Neutron - description: Problems related to OpenStack Neutron, such as network provisioning or connectivity failures - - name: nil-pointer - displayName: Nil Pointer - description: Problems where code dereferences nil pointers, causing runtime crashes - - name: observability - displayName: Observability - description: Problems in observability tooling, such as unintended performance impact or missing telemetry - - name: payload - displayName: Payload - description: Problems related to message payloads, such as malformed data or size limit violations - - name: nats - displayName: NATS - description: Problems related to NATS, such as authorization failures, message loss, or configuration issues - - name: authorization - displayName: Authorization - description: Problems related to authorization, such as missing or invalid credentials, or misconfigurations - - name: nginx - displayName: Nginx - description: Problems related to Nginx, such as weak ciphers, configuration errors, or performance issues - - name: tls - displayName: TLS - description: Problems related to TLS, such as weak ciphers, configuration errors, or performance issues - - name: weak-ciphers - displayName: Weak Ciphers - description: Problems related to weak ciphers, such as RC4, DES, or MD5 - - name: alloy - displayName: Alloy - description: Problems related to Grafana alloy, such as Loki fanout crashes, or entries too far behind. - - name: public - displayName: Public - description: Open source CREs contributed by the problem detection community - - name: kubelet - displayName: Kubelet - description: Problems related to Kubelet, such as node not ready, or pod failures - - name: dns - displayName: DNS - description: Problems related to DNS, such as hostname resolution failures, or DNS server misconfigurations - - name: api-server - displayName: API Server - description: Problems involving the Kubernetes API server, such as request failures, unavailability, or authentication errors - - name: throttling - displayName: Throttling - description: Problems where requests are delayed or dropped due to client-side or server-side rate limits or resource contention - - name: performance - displayName: Performance - description: Issues that impact system responsiveness or efficiency, such as latency, CPU/memory bottlenecks, or slow processing - - name: rate-limiting - displayName: Rate Limiting - description: Problems where systems enforce limits on request rates, often resulting in 429 errors or degraded service behavior - - name: upstream-failure - displayName: Upstream Failure - description: Problems where Nginx cannot successfully forward requests to backend services - - name: connection-refused - displayName: Connection Refused - description: Problems where a connection attempt is rejected by the target server - - name: buffer - displayName: Buffer - description: Problems related to buffering - - name: capacity-issue - displayName: Capacity Issue - description: Problems related to system capacity - - name: connectivity - displayName: Connectivity - description: Problems related to network connectivity - - name: header-size - displayName: Header Size - description: Problems related to the size of headers - - name: upload-limits - displayName: Upload Limits - description: Problems related to upload size limits - - name: proxy - displayName: Proxy - description: Problems related to proxy configurations or usage - - name: request-size - displayName: Request Size - description: Problems related to the size of requests - - name: web-server - displayName: Web Server - description: Problems related to web server configurations or issues - - name: admission-controller - displayName: Admission Controller - description: Problems related to Kubernetes admission controllers - - name: aws-cni - displayName: AWS CNI - description: Problems related to Amazon Web Services VPC Container Network Interface (CNI) plugin - - name: cni - displayName: CNI - description: Problems related to Container Network Interface - - name: disk-monitor - displayName: Disk Monitor - description: Problems related to disk monitoring - - name: disk-full - displayName: Disk Full - description: Problems related to disk full errors, such as insufficient space for writes or data storage - - name: istio - displayName: Istio - description: Problems related to the Istio service mesh - - name: kiali - displayName: Kiali - description: Problems related to the Kiali service mesh monitoring tool - - name: kombu - displayName: Kombu - description: Problems related to the Kombu messaging library - - name: backend-issue - displayName: Backend Issue - description: Problems related to the backend systems or services - - name: cws - displayName: CWS - description: Problems related to Cloud Workload Security + - name: high-cpu-usage + displayName: High CPU Usage + - name: high-latency + displayName: High Latency + - name: invalid-payload + displayName: Invalid Payload - name: log-noise displayName: Log Noise - description: Problems related to unwanted or irrelevant log data - - name: memory-leak - displayName: Memory Leak - description: Problems caused by memory leaks in applications - - name: monitoring - displayName: Monitoring - description: Problems related to system or application monitoring - - name: npa - displayName: NPA - description: Problems related to Network Policy Administration - - name: permissions - displayName: Permissions - description: Problems related to user or system permissions - - name: log‑noise - displayName: Log Noise - description: Problems related to excessive or irrelevant log entries that obscure meaningful information. - - name: postgresql - displayName: PostgreSQL - description: Problems related to the PostgreSQL database system. - - name: rds - displayName: RDS - description: Problems related to Amazon Relational Database Service (RDS). - - name: service-mesh - displayName: Service Mesh - description: Problems related to service mesh technologies and implementations. - - name: silent‑failure - displayName: Silent Failure - description: Problems that do not produce visible errors or logs, making them hard to detect. - - name: terraform - displayName: Terraform - description: Problems related to the Terraform infrastructure as code tool. - - name: version‑incompatibility - displayName: Version Incompatibility - description: Problems arising from incompatible versions of software components or libraries. - - name: vpc-cni - displayName: VPC CNI - description: Problems related to the VPC CNI (Container Network Interface) plugin. - - name: webhook - displayName: webhook - description: Problems related to webhooks. - - name: ubuntu - displayName: Ubuntu - description: Problems related to Ubuntu, such as package updates, or desktop issues - - name: gnome - displayName: Gnome - description: Problems related to Gnome, such as input lag, or performance issues - - name: nvidia - displayName: Nvidia - description: Problems related to Nvidia, such as driver issues, or performance issues - - name: xorg - displayName: Xorg - description: Problems related to Xorg, such as input lag, or performance issues - - name: slurm - displayName: SLURM - description: Problems related to SLURM workload manager - - name: slurmdbd - displayName: SlurmDBD - description: Problems related to SLURM Database Daemon - - name: mysql - displayName: MySQL - description: Problems related to MySQL database - - name: high-availability - displayName: High Availability - description: Problems related to high-availability systems and failover - - name: write-failure - displayName: Write Failure - description: Problems where writes to a database or storage system fail due to insufficient space or other issues - - name: out-of-memory - displayName: Out of Memory - description: Errors due to Redis (or other) exhausting its configured RAM. - - name: persistence - displayName: Persistence - description: Issues around writing data to disk (RDB/AOF) or failing to persist. - - name: rdb - displayName: RDB - description: Redis RDB snapshot errors (e.g. BGSAVE failures). - - name: misconf - displayName: MISCONF - description: Redis "MISCONF" errors (stop-writes due to snapshot or AOF failures). - - name: readonly - displayName: READONLY - description: Errors when writing to a read-only Redis replica. - - name: acl - displayName: ACL - description: Redis ACL (NOPERM) permission-denied events. - - name: nfs - displayName: NFS - description: Problems related to NFS (network file systems) - - name: securityContext - displayName: securityContext - description: Problems related to Kubernetes securityContext - - name: broker-failure - displayName: Broker Failure - description: Problems related to Kakfa broker failures - - name: cluster-degradation - displayName: Cluster Degradation - description: Problems related to cluster availability - - name: etcd - displayName: Etcd - description: Issues involving etcd clusters or consensus, especially in HA setups. - - name: patroni - displayName: Patroni - description: Issues related to Patroni high-availability controller for PostgreSQL. - - name: zalando - displayName: Zalando - description: Issues related to the Zalando Postgres Operator for HA Postgres. - - name: ha - displayName: High Availability - description: Problems or incidents involving high-availability clusters, failover, or consensus. - - name: replication - displayName: Replication - description: Replication failures, lag, or divergence in stateful systems. - - name: wal - displayName: WAL - description: Issues with Write-Ahead Logging in databases. - - name: quorum - displayName: Quorum - description: Loss or degradation of cluster quorum in distributed systems. - - name: load-balancer-problem - displayName: Load Balancer Problem - description: Problems related to load balancers, such as misrouting, unhealthy backends, or configuration faults - - name: reverse-proxy - displayName: Reverse Proxy - description: Problems related to reverse proxy configurations or issues - - name: service-outage - displayName: Service Outage - description: Problems related to service outages, such as complete service unavailability or critical failures - - name: cascading-failure - displayName: Cascading Failure - description: Problems related to cascading failures, where one failure leads to multiple dependent failures - - name: demo-problem - displayName: Demo Problem - description: This is a tag for demo problems - - name: workflow-orchestration - displayName: Workflow Orchestration - description: Problems related to workfow orchestration - - name: temporal - displayName: Temporal - description: Problems related to Temporal - - name: worker - displayName: Worker Problems - description: Problems related to process workers - - name: raft - displayName: Raft - description: Issues related to Raft consensus protocol, elections, and leader changes - - name: leader-election - displayName: Leader Election - description: Events and errors related to leader election processes in distributed systems - - name: grpc - displayName: gRPC - description: Problems related to gRPC - - name: streaming-data - displayName: Streaming Data - description: Problems related to streaming data platforms and systems - - name: cluster-failure - displayName: Cluster Failure - description: Problems related to cluster failures, including node loss, quorum loss, and data availability impact - - name: node-down - displayName: Node Down - description: Problems related to nodes going down in a cluster, impacting availability and performance + - name: nil-pointer-dereference + displayName: Nil Pointer Dereference + - name: node-unresponsive + displayName: Node Unresponsive + - name: oom-error + displayName: Out-of-Memory Error + - name: permission-denied + displayName: Permission Denied - name: quorum-loss displayName: Quorum Loss - description: Problems related to loss of quorum in distributed systems, impacting consensus and availability - - name: data-availability - displayName: Data Availability - description: Problems related to data availability in distributed systems, such as loss of access to critical data - - name: rpc - displayName: RPC - description: Remote Procedure Call errors or connectivity issues (includes timeouts, client-request failures, handler-not-found, etc.). - - name: migration-failure - displayName: Migration Failure - description: Errors caused by failed or skipped database migrations - name: schema-error displayName: Schema Error - description: Missing or corrupted database schema elements such as tables or columns - - name: spicedb - displayName: SpiceDB - description: Problems related to SpiceDB authorization service, including schema corruption, permission failures, and database connectivity issues - - name: fine-grained-access-control - displayName: Fine-Grained Access Control - description: Problems related to fine-grained authorization and permission systems that manage detailed access policies - - name: database-corruption - displayName: Database Corruption - description: Problems where database tables, schemas, or data become corrupted, leading to missing relations or inaccessible data - - name: permission-failure - displayName: Permission Failure - description: Problems where authorization checks fail due to system issues rather than legitimate access denials - - name: schema-corruption - displayName: Schema Corruption - description: Problems where database schemas become corrupted or missing, preventing normal operations - - name: relation-missing - displayName: Relation Missing - description: Database errors where expected tables or relations do not exist, typically due to corruption or migration failures - - name: logs - displayName: Logs - description: Problems with log processing - - name: redpanda-startup - displayName: Redpanda Startup - description: Problems related to Redpanda startup, such as configuration issues or missing state - - name: redpanda-state-missing - displayName: Redpanda State Missing - description: Problems related to missing state in Redpanda, such as missing offsets or partitions - - name: snapshot - displayName: Snapshot - description: Problems related to snapshotting in Redpanda, such as missing or incomplete snapshots - - name: distributed-system - displayName: Distributed System - description: Problems specific to distributed systems, including coordination, consistency, and network partition issues - - name: datastore - displayName: Datastore - description: Problems with data storage systems, such as databases or object stores - - name: startup-failure - displayName: Startup Failure - description: Problems related to application or service startup failures, such as missing dependencies or configuration errors - - name: container-crash - displayName: Container Crash - description: Failures causing container crashes or unexpected terminations. - - name: memory-exhaustion - displayName: Memory Exhaustion - description: Failures due to running out of memory or excessive memory consumption. - - name: configuration-failure - displayName: Configuration Failure - description: Problems caused by incorrect or invalid configuration settings. - - name: streaming-platform - displayName: Streaming Platform - description: Issues related to distributed streaming platforms and their operations. - - name: kafka-compatible - displayName: Kafka Compatible - description: Problems affecting Kafka-compatible systems or APIs, impacting interoperability. - - name: permission-denied - displayName: Permission Denied - description: Failures caused by insufficient access rights or permission errors. - - name: sigkill - displayName: SIGKILL - description: Failures caused by processes being terminated with a SIGKILL signal. - - name: jetstream - displayName: JetStream - description: NATS JetStream persistence & streaming subsystem issues. - - name: ack-deadlock - displayName: Ack Deadlock - description: Deadlocks caused by unacknowledged messages or backpressure in JetStream acks. - - name: unsynced-replica - displayName: Unsynced Replica - description: JetStream replicas that fail to synchronize state with the leader after restart or failover. - - name: connection-exhaustion - displayName: Connection Exhaustion - description: Problems where systems reach their maximum connection limits, preventing new connections and causing service degradation - - name: connection-limit - displayName: Connection Limit - description: Issues related to connection limits being reached or exceeded in messaging systems, databases, or network services - - name: messaging-failure - displayName: Messaging Failure - description: Failures in messaging infrastructure that prevent or disrupt message delivery between services - - name: infrastructure-failure - displayName: Infrastructure Failure - description: Critical failures in core infrastructure components that can cause cascading system outages - - name: scalability-issue - displayName: Scalability Issue - description: Problems that occur when systems cannot scale to meet demand, often exposing resource or design limitations - - name: microservices - displayName: Microservices - description: Problems specific to microservices architectures, including inter-service communication and distributed system challenges - - name: critical-infrastructure - displayName: Critical Infrastructure - description: Issues affecting mission-critical infrastructure components that require immediate attention to prevent widespread outages + - name: silent-failure + displayName: Silent Failure + - name: slow-consumer + displayName: Slow Consumer + - name: upstream-failure + displayName: Upstream Failure + - name: weak-ciphers + displayName: Weak Ciphers + - name: bug + displayName: Software Bug + - name: hardware-failure + displayName: Hardware Failure + - name: misconfiguration + displayName: Misconfiguration + - name: network-partition + displayName: Network Partition + - name: race-condition + displayName: Race Condition