Skip to content

Commit 1e34924

Browse files
authored
Merge branch 'main' into k8s-exit-code
2 parents 2937d73 + c414539 commit 1e34924

File tree

11 files changed

+329
-50
lines changed

11 files changed

+329
-50
lines changed

README.md

Lines changed: 9 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -60,58 +60,19 @@ New contributors are encouraged to join the problem detection community add new
6060

6161
## Rule Coverage
6262

63-
### Tags
63+
### Tags & Categories
6464

65-
* [Tags](rules/tags/tags.yaml)
66-
* [Categories](rules/tags/categories.yaml)
65+
CREs are organized by tags and categories.
66+
67+
* [Tags](https://docs.prequel.dev/cres/public?view=tags)
68+
* [Categories](https://docs.prequel.dev/cres/public?view=categories)
6769

6870
### Technology Coverage
6971

70-
The table below lists the technologies targeted by the existing CRE rules and the number of rules that describe each technology.
71-
72-
<!-- BEGIN TECHNOLOGY TABLE -->
73-
| Technology | CRE Count | Documentation |
74-
|-----------|----------:|---------------|
75-
| [nginx](https://nginx.org/en/docs/) | 8 | https://nginx.org/en/docs/ |
76-
| [loki](https://grafana.com/docs/loki/latest/) | 6 | https://grafana.com/docs/loki/latest/ |
77-
| [otel-collector](https://opentelemetry.io/docs/collector/) | 4 | https://opentelemetry.io/docs/collector/ |
78-
| [kubernetes](https://kubernetes.io/docs/home/) | 4 | https://kubernetes.io/docs/home/ |
79-
| [aws](https://aws.amazon.com/) | 4 | https://aws.amazon.com/ |
80-
| [rabbitmq](https://www.rabbitmq.com/documentation.html) | 4 | https://www.rabbitmq.com/documentation.html |
81-
| [redis](https://redis.io/docs/) | 4 | https://redis.io/docs/ |
82-
| [grafana](https://grafana.com/docs/) | 4 | https://grafana.com/docs/ |
83-
| [ovn](https://www.ovn.org/docs/) | 3 | https://www.ovn.org/docs/ |
84-
| [datadog](https://docs.datadoghq.com/) | 3 | https://docs.datadoghq.com/ |
85-
| [neutron](https://docs.openstack.org/neutron/latest/) | 2 | https://docs.openstack.org/neutron/latest/ |
86-
| [openstack](https://docs.openstack.org/) | 2 | https://docs.openstack.org/ |
87-
| [keda](https://keda.sh/docs/) | 2 | https://keda.sh/docs/ |
88-
| [opentelemetry](https://opentelemetry.io/docs/) | 2 | https://opentelemetry.io/docs/ |
89-
| [postgres](https://www.postgresql.org/docs/) | 2 | https://www.postgresql.org/docs/ |
90-
| [dns](https://en.wikipedia.org/wiki/Domain_Name_System) | 2 | https://en.wikipedia.org/wiki/Domain_Name_System |
91-
| [memcached](https://memcached.org/) | 2 | https://memcached.org/ |
92-
| [prometheus](https://prometheus.io/docs/) | 2 | https://prometheus.io/docs/ |
93-
| [karpenter](https://karpenter.sh/docs/) | 2 | https://karpenter.sh/docs/ |
94-
| [cws](https://docs.datadoghq.com/cloud_workload_security/) | 1 | https://docs.datadoghq.com/cloud_workload_security/ |
95-
| [postgresql](https://www.postgresql.org/docs/) | 1 | https://www.postgresql.org/docs/ |
96-
| [nfs](https://wiki.linux-nfs.org/wiki/) | 1 | https://wiki.linux-nfs.org/wiki/ |
97-
| [nvidia](https://docs.nvidia.com/) | 1 | https://docs.nvidia.com/ |
98-
| [helm](https://helm.sh/docs/) | 1 | https://helm.sh/docs/ |
99-
| [temporal](https://docs.temporal.io/) | 1 | https://docs.temporal.io/ |
100-
| [slurm](https://slurm.schedmd.com/documentation.html) | 1 | https://slurm.schedmd.com/documentation.html |
101-
| [slurmdbd](https://slurm.schedmd.com/slurmdbd.html) | 1 | https://slurm.schedmd.com/slurmdbd.html |
102-
| [mysql](https://dev.mysql.com/doc/) | 1 | https://dev.mysql.com/doc/ |
103-
| [redis-cli](https://redis.io/docs/ui/cli/) | 1 | https://redis.io/docs/ui/cli/ |
104-
| [kubelet](https://kubernetes.io/docs/concepts/architecture/nodes/#kubelet) | 1 | https://kubernetes.io/docs/concepts/architecture/nodes/#kubelet |
105-
| [redis-py](https://redis-py.readthedocs.io/en/stable/) | 1 | https://redis-py.readthedocs.io/en/stable/ |
106-
| [spicedb](https://spicedb.dev/) | 1 | https://spicedb.dev/ |
107-
| [celery](https://docs.celeryq.dev/en/stable/) | 1 | https://docs.celeryq.dev/en/stable/ |
108-
| [kombu](https://docs.celeryq.dev/projects/kombu/en/stable/) | 1 | https://docs.celeryq.dev/projects/kombu/en/stable/ |
109-
| [vpc-cni](https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html) | 1 | https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html |
110-
| [csi](https://kubernetes-csi.github.io/docs/) | 1 | https://kubernetes-csi.github.io/docs/ |
111-
| [terraform](https://developer.hashicorp.com/terraform/docs) | 1 | https://developer.hashicorp.com/terraform/docs |
112-
| [ovsdb](https://docs.openvswitch.org/en/latest/ref/ovsdb/) | 1 | https://docs.openvswitch.org/en/latest/ref/ovsdb/ |
113-
| [eks](https://docs.aws.amazon.com/eks/) | 1 | https://docs.aws.amazon.com/eks/ |
114-
| [gke](https://cloud.google.com/kubernetes-engine/docs/) | 1 | https://cloud.google.com/kubernetes-engine/docs/ |
72+
CREs exist for both popular and obscure project.
73+
74+
* [CREs by Technology](https://docs.prequel.dev/cres/public?view=technologies)
75+
11576

11677
## Join the community!
11778

rules/cre-2025-0102/redpanda-test-error.yaml renamed to rules/cre-2025-0102/redpanda-quorum-error.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,9 @@ rules:
6767
reports: 1
6868
rule:
6969
set:
70+
window: 10s
7071
event:
7172
source: cre.log.redpanda
7273
match:
73-
- regex: 'failure|leaving all raft groups|down|CRITICAL|Multiple nodes unresponsive|Low available memory|health degraded'
74+
- value: 'Marking node as down'
75+
- value: 'Not enough live replicas to form quorum'

rules/cre-2025-0126/mongodb-primary-election-failure.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ rules:
44
id: 5UD1RZxGC5LJQnVmAkV11B
55
gen: 1
66
cre:
7-
id: CRE-2025-0108
7+
id: CRE-2025-0126
88
severity: 1
99
title: "MongoDB Replica Set Primary Election Failure"
1010
category: "database-problem"
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
rules:
2+
- metadata:
3+
kind: prequel
4+
id: SD8xK2mN9pQzYvWr3aLfJ7
5+
hash: XpQ9Lm4Zk8TnVb2Ry6HwGs
6+
cre:
7+
id: CRE-2025-0162
8+
severity: 1
9+
title: "Stable Diffusion WebUI CUDA Out of Memory Crash"
10+
category: "memory-problem"
11+
author: Prequel Community
12+
description: |
13+
Detects critical CUDA out of memory errors in Stable Diffusion WebUI that cause image generation failures and application crashes. This occurs when GPU VRAM is exhausted during model loading or image generation, resulting in complete task failure and potential WebUI instability.
14+
cause: |
15+
- Insufficient GPU VRAM for requested image resolution or batch size
16+
- Memory fragmentation preventing large contiguous allocations
17+
- Model loading exceeding available VRAM capacity
18+
- Concurrent GPU processes consuming memory
19+
- High-resolution image generation without memory optimization flags
20+
impact: |
21+
- Complete image generation failure
22+
- WebUI crash requiring restart
23+
- Loss of in-progress generation work
24+
- Potential GPU driver instability
25+
- Service unavailability for users
26+
tags:
27+
- memory
28+
- nvidia
29+
- crash
30+
- out-of-memory
31+
- configuration
32+
mitigation: |
33+
IMMEDIATE ACTIONS:
34+
- Restart Stable Diffusion WebUI
35+
- Clear GPU memory: nvidia-smi --gpu-reset
36+
- Add memory optimization flags: --medvram or --lowvram
37+
CONFIGURATION FIXES:
38+
- For 4-6GB VRAM: Add --medvram to webui-user.bat
39+
- For 2-4GB VRAM: Add --lowvram to webui-user.bat
40+
- Enable xformers: --xformers for memory efficiency
41+
- Add --always-batch-cond-uncond for batch processing
42+
RUNTIME ADJUSTMENTS:
43+
- Reduce image resolution (512x512 instead of 1024x1024)
44+
- Decrease batch size to 1
45+
- Lower batch count for multiple generations
46+
- Set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
47+
PREVENTION:
48+
- Monitor GPU memory usage with nvidia-smi
49+
- Implement gradual resolution scaling
50+
- Use cloud services for high-resolution generation
51+
- Upgrade to GPU with minimum 8GB VRAM
52+
references:
53+
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992
54+
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9770
55+
- https://github.com/CompVis/stable-diffusion/issues/39
56+
applications:
57+
- name: stable-diffusion-webui
58+
version: ">=1.0.0"
59+
impactScore: 8
60+
mitigationScore: 7
61+
reports: 15
62+
rule:
63+
set:
64+
window: 120s
65+
event:
66+
source: cre.log.stable-diffusion
67+
match:
68+
- regex: 'OutOfMemoryError.*CUDA out of memory'
69+
- regex: 'CUDA out of memory.*Tried to allocate'
70+
- regex: 'model failed to load.*OutOfMemoryError'

rules/cre-2025-0162/test.log

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
2025-08-29 14:23:45.123 [ERROR] Loading model stable-diffusion-v1.5
2+
2025-08-29 14:23:47.456 [INFO] Model weights: 4.27 GB
3+
2025-08-29 14:23:48.789 [INFO] Allocating GPU memory...
4+
2025-08-29 14:23:49.012 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 4.50 GiB already allocated; 1.20 GiB free; 4.80 GiB reserved in total by PyTorch)
5+
2025-08-29 14:23:49.013 [ERROR] RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Process 12345 has 4.50 GiB memory in use.
6+
2025-08-29 14:23:49.014 [CRITICAL] Stable Diffusion model failed to load: OutOfMemoryError
7+
2025-08-29 14:23:49.015 [ERROR] CUDA error: out of memory
8+
2025-08-29 14:23:49.016 [ERROR] GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Allocation failed.
9+
2025-08-29 14:23:49.017 [ERROR] Failed to generate image: CUDA out of memory
10+
2025-08-29 14:23:49.018 [INFO] Attempting to clear cache...
11+
2025-08-29 14:23:50.123 [INFO] Cache cleared, retrying...
12+
2025-08-29 14:23:51.456 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB
13+
2025-08-29 14:23:51.457 [CRITICAL] Image generation failed after retry
14+
2025-08-29 14:23:51.458 [ERROR] WebUI shutting down due to memory error
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
rules:
2+
- cre:
3+
id: CRE-2025-0179
4+
severity: 0
5+
title: N8N Workflow Silent Data Loss During Execution
6+
category: workflow-automation-problem
7+
author: Claude Code Assistant
8+
description: |
9+
N8N workflow automation platform experiences critical silent data loss where items
10+
disappear between workflow nodes without generating error messages. This high-severity
11+
issue affects long-running workflows (60-115+ minutes) and can cause workflows to
12+
randomly cancel mid-execution, leading to incomplete processing and data integrity
13+
problems. Items silently vanish between nodes, with different item counts across
14+
the workflow pipeline, making the issue particularly dangerous for production systems
15+
that rely on complete data processing.
16+
cause: |
17+
* Workflow execution engine fails to properly track items between nodes in long-running workflows
18+
* Memory management issues during extended workflow processing causing item references to be lost
19+
* Race conditions in the worker queue system when handling multiple concurrent items
20+
* Node-to-node data transfer mechanisms failing silently under certain load conditions
21+
* Queue worker timeout or resource contention causing partial item processing without error reporting
22+
* Database transaction issues where some items fail to persist between workflow stages
23+
tags:
24+
- n8n
25+
- workflow-automation
26+
- data-loss
27+
- silent-failure
28+
- production-critical
29+
- data-integrity
30+
- public
31+
mitigation: |
32+
- **Implement workflow item counting checks** - Add validation nodes between critical
33+
processing steps to verify item counts match expected values
34+
- **Enable comprehensive execution logging** - Set N8N_LOG_LEVEL to debug and
35+
EXECUTIONS_DATA_SAVE_ON_SUCCESS to 'all' to capture detailed execution data
36+
- **Add workflow timeout monitoring** - Monitor executions that cancel around 21-23
37+
minute mark and implement retry mechanisms for failed workflows
38+
- **Implement data integrity validation** - Add checksum or validation steps at
39+
workflow start/end to detect silent data loss
40+
- **Use error handling workflows** - Configure error workflows to capture and log
41+
execution failures, even when main workflow fails silently
42+
- **Monitor execution metrics** - Set up alerting on workflow completion rates and
43+
item processing inconsistencies
44+
- **Consider workflow segmentation** - Break long workflows into smaller, more
45+
manageable chunks to reduce exposure to the data loss issue
46+
references:
47+
- https://github.com/n8n-io/n8n/issues/14909
48+
- https://docs.n8n.io/flow-logic/error-handling/
49+
- https://community.n8n.io/t/workflow-randomly-cancels-mid-execution-without-error-data-items-silently-dropped-between-nodes/51141
50+
applications:
51+
- name: n8n
52+
version: ">= 1.90.0"
53+
processName: n8n
54+
containerName: n8n
55+
impact: |
56+
Silent data loss in workflow automation can cause critical business processes to fail
57+
without detection, leading to incomplete data processing, missing business transactions,
58+
failed integrations, and potential compliance violations. The silent nature makes it
59+
extremely difficult to detect and troubleshoot, potentially causing weeks or months
60+
of data integrity issues before discovery.
61+
impactScore: 9
62+
mitigationScore: 7
63+
metadata:
64+
kind: prequel
65+
id: N8nSilentDataLossDetection919
66+
gen: 1
67+
rule:
68+
sequence:
69+
window: 120s
70+
event:
71+
source: cre.log.n8n
72+
order:
73+
- regex: "(cancelled mid-execution|execution terminated unexpectedly|workflow.*cancelled|Execution.*cancelled)"
74+
- regex: "(silent data loss detected|data.*loss|itemsLost|dataIntegrityIssue.*true|Items processed inconsistently|Data integrity check failed|Expected [0-9]+ items, found [0-9]+ items)"

rules/cre-2025-0179/test.log

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Aug 27 18:30:29 n8n[1234]: INFO: Starting workflow execution exec_384574 for workflow workflow_9084
2+
Aug 27 18:35:29 n8n[1234]: DEBUG: Node processing started - HTTP Request node
3+
Aug 27 18:45:29 n8n[1234]: INFO: Processing 150 items through workflow pipeline
4+
Aug 27 18:53:29 n8n[1234]: DEBUG: Node completed with 142 items (expected 150)
5+
Aug 27 19:05:29 n8n[1234]: DEBUG: Transform node processing remaining items
6+
Aug 27 19:25:29 n8n[1234]: WARN: Execution exec_384574 cancelled mid-execution after 55 minutes
7+
Aug 27 19:25:44 n8n[1234]: ERROR: Data integrity check failed - Items processed inconsistently across nodes
8+
Aug 27 19:25:49 n8n[1234]: ERROR: Expected 150 items, found 127 items at completion
9+
Aug 27 19:26:15 n8n[1234]: CRITICAL: Massive data loss detected - Expected 500 items, found 75 items
10+
Aug 27 19:26:20 n8n[1234]: ERROR: Critical workflow failure detected - 85% data loss in processing pipeline
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
rules:
2+
- cre:
3+
id: CRE-2025-0200
4+
severity: 0
5+
title: AutoGPT Recursive Self-Analysis Loop Leading to Token Exhaustion and System Crash
6+
category: infinite-loop-problem
7+
author: prequel
8+
description: |
9+
- AutoGPT enters an infinite recursive loop when attempting to analyze and fix its own execution errors
10+
- The agent repeatedly tries to debug its own code, spawning new analysis tasks for each failure
11+
- Each iteration consumes API tokens and memory, eventually exhausting resources
12+
- The loop accelerates as error messages grow longer, consuming tokens exponentially
13+
- System becomes unresponsive and crashes with out-of-memory errors or API rate limit failures
14+
cause: |
15+
- AutoGPT's autonomous reasoning incorrectly identifies its own execution as a problem to solve
16+
- Lack of loop detection mechanisms allows unlimited recursive task spawning
17+
- Error context accumulation causes exponential growth in prompt size
18+
- Missing safeguards for self-referential task creation
19+
- Insufficient resource monitoring and circuit breakers for runaway processes
20+
tags:
21+
- autogpt
22+
- infinite-loop
23+
- token-exhaustion
24+
- autonomous-agents
25+
- llm
26+
- openai
27+
- recursive-analysis
28+
- critical-failure
29+
- memory-exhaustion
30+
- crash-loop
31+
- rate-limiting
32+
mitigation: |
33+
- Implement loop detection to identify and break recursive self-analysis patterns
34+
- Add resource consumption thresholds (tokens, memory, API calls) with automatic shutdown
35+
- Create task depth limits to prevent unlimited recursion
36+
- Implement circuit breakers that trigger after repeated similar failures
37+
- Add explicit blacklist for self-referential task creation
38+
- Monitor token usage rate and implement exponential backoff
39+
- Use separate monitoring process to detect and kill runaway AutoGPT instances
40+
- Implement task deduplication to prevent identical recursive operations
41+
references:
42+
- https://github.com/Significant-Gravitas/AutoGPT/issues/1994
43+
- https://github.com/Significant-Gravitas/AutoGPT/issues/3766
44+
- https://github.com/Significant-Gravitas/AutoGPT/issues/1543
45+
- https://jina.ai/news/auto-gpt-unmasked-hype-hard-truths-production-pitfalls/
46+
applications:
47+
- name: autogpt
48+
version: ">=0.3.0"
49+
- name: openai
50+
version: ">=0.27.0"
51+
impact: Complete system failure with resource exhaustion, potential financial losses from API overconsumption
52+
impactScore: 9
53+
mitigationScore: 3
54+
reports: 15
55+
metadata:
56+
kind: prequel
57+
id: 8qy5Et9NbNGgGxhBP7umKa
58+
gen: 1
59+
rule:
60+
set:
61+
window: 30s
62+
event:
63+
source: cre.log.autogpt
64+
match:
65+
- value: 'Entering recursive analysis loop'
66+
- value: 'COMMAND = analyze_code'
67+
- value: 'recursion depth'
68+
- value: 'RecursionError: maximum recursion depth exceeded'

0 commit comments

Comments
 (0)