Skip to content

Commit 3c1738a

Browse files
Merge branch 'main' into supabase-cre-rules
2 parents 396d2aa + c414539 commit 3c1738a

File tree

8 files changed

+273
-24
lines changed

8 files changed

+273
-24
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
rules:
2+
- metadata:
3+
kind: prequel
4+
id: SD8xK2mN9pQzYvWr3aLfJ7
5+
hash: XpQ9Lm4Zk8TnVb2Ry6HwGs
6+
cre:
7+
id: CRE-2025-0162
8+
severity: 1
9+
title: "Stable Diffusion WebUI CUDA Out of Memory Crash"
10+
category: "memory-problem"
11+
author: Prequel Community
12+
description: |
13+
Detects critical CUDA out of memory errors in Stable Diffusion WebUI that cause image generation failures and application crashes. This occurs when GPU VRAM is exhausted during model loading or image generation, resulting in complete task failure and potential WebUI instability.
14+
cause: |
15+
- Insufficient GPU VRAM for requested image resolution or batch size
16+
- Memory fragmentation preventing large contiguous allocations
17+
- Model loading exceeding available VRAM capacity
18+
- Concurrent GPU processes consuming memory
19+
- High-resolution image generation without memory optimization flags
20+
impact: |
21+
- Complete image generation failure
22+
- WebUI crash requiring restart
23+
- Loss of in-progress generation work
24+
- Potential GPU driver instability
25+
- Service unavailability for users
26+
tags:
27+
- memory
28+
- nvidia
29+
- crash
30+
- out-of-memory
31+
- configuration
32+
mitigation: |
33+
IMMEDIATE ACTIONS:
34+
- Restart Stable Diffusion WebUI
35+
- Clear GPU memory: nvidia-smi --gpu-reset
36+
- Add memory optimization flags: --medvram or --lowvram
37+
CONFIGURATION FIXES:
38+
- For 4-6GB VRAM: Add --medvram to webui-user.bat
39+
- For 2-4GB VRAM: Add --lowvram to webui-user.bat
40+
- Enable xformers: --xformers for memory efficiency
41+
- Add --always-batch-cond-uncond for batch processing
42+
RUNTIME ADJUSTMENTS:
43+
- Reduce image resolution (512x512 instead of 1024x1024)
44+
- Decrease batch size to 1
45+
- Lower batch count for multiple generations
46+
- Set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
47+
PREVENTION:
48+
- Monitor GPU memory usage with nvidia-smi
49+
- Implement gradual resolution scaling
50+
- Use cloud services for high-resolution generation
51+
- Upgrade to GPU with minimum 8GB VRAM
52+
references:
53+
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992
54+
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9770
55+
- https://github.com/CompVis/stable-diffusion/issues/39
56+
applications:
57+
- name: stable-diffusion-webui
58+
version: ">=1.0.0"
59+
impactScore: 8
60+
mitigationScore: 7
61+
reports: 15
62+
rule:
63+
set:
64+
window: 120s
65+
event:
66+
source: cre.log.stable-diffusion
67+
match:
68+
- regex: 'OutOfMemoryError.*CUDA out of memory'
69+
- regex: 'CUDA out of memory.*Tried to allocate'
70+
- regex: 'model failed to load.*OutOfMemoryError'

rules/cre-2025-0162/test.log

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
2025-08-29 14:23:45.123 [ERROR] Loading model stable-diffusion-v1.5
2+
2025-08-29 14:23:47.456 [INFO] Model weights: 4.27 GB
3+
2025-08-29 14:23:48.789 [INFO] Allocating GPU memory...
4+
2025-08-29 14:23:49.012 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 4.50 GiB already allocated; 1.20 GiB free; 4.80 GiB reserved in total by PyTorch)
5+
2025-08-29 14:23:49.013 [ERROR] RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Process 12345 has 4.50 GiB memory in use.
6+
2025-08-29 14:23:49.014 [CRITICAL] Stable Diffusion model failed to load: OutOfMemoryError
7+
2025-08-29 14:23:49.015 [ERROR] CUDA error: out of memory
8+
2025-08-29 14:23:49.016 [ERROR] GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Allocation failed.
9+
2025-08-29 14:23:49.017 [ERROR] Failed to generate image: CUDA out of memory
10+
2025-08-29 14:23:49.018 [INFO] Attempting to clear cache...
11+
2025-08-29 14:23:50.123 [INFO] Cache cleared, retrying...
12+
2025-08-29 14:23:51.456 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB
13+
2025-08-29 14:23:51.457 [CRITICAL] Image generation failed after retry
14+
2025-08-29 14:23:51.458 [ERROR] WebUI shutting down due to memory error
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
rules:
2+
- cre:
3+
id: CRE-2025-0179
4+
severity: 0
5+
title: N8N Workflow Silent Data Loss During Execution
6+
category: workflow-automation-problem
7+
author: Claude Code Assistant
8+
description: |
9+
N8N workflow automation platform experiences critical silent data loss where items
10+
disappear between workflow nodes without generating error messages. This high-severity
11+
issue affects long-running workflows (60-115+ minutes) and can cause workflows to
12+
randomly cancel mid-execution, leading to incomplete processing and data integrity
13+
problems. Items silently vanish between nodes, with different item counts across
14+
the workflow pipeline, making the issue particularly dangerous for production systems
15+
that rely on complete data processing.
16+
cause: |
17+
* Workflow execution engine fails to properly track items between nodes in long-running workflows
18+
* Memory management issues during extended workflow processing causing item references to be lost
19+
* Race conditions in the worker queue system when handling multiple concurrent items
20+
* Node-to-node data transfer mechanisms failing silently under certain load conditions
21+
* Queue worker timeout or resource contention causing partial item processing without error reporting
22+
* Database transaction issues where some items fail to persist between workflow stages
23+
tags:
24+
- n8n
25+
- workflow-automation
26+
- data-loss
27+
- silent-failure
28+
- production-critical
29+
- data-integrity
30+
- public
31+
mitigation: |
32+
- **Implement workflow item counting checks** - Add validation nodes between critical
33+
processing steps to verify item counts match expected values
34+
- **Enable comprehensive execution logging** - Set N8N_LOG_LEVEL to debug and
35+
EXECUTIONS_DATA_SAVE_ON_SUCCESS to 'all' to capture detailed execution data
36+
- **Add workflow timeout monitoring** - Monitor executions that cancel around 21-23
37+
minute mark and implement retry mechanisms for failed workflows
38+
- **Implement data integrity validation** - Add checksum or validation steps at
39+
workflow start/end to detect silent data loss
40+
- **Use error handling workflows** - Configure error workflows to capture and log
41+
execution failures, even when main workflow fails silently
42+
- **Monitor execution metrics** - Set up alerting on workflow completion rates and
43+
item processing inconsistencies
44+
- **Consider workflow segmentation** - Break long workflows into smaller, more
45+
manageable chunks to reduce exposure to the data loss issue
46+
references:
47+
- https://github.com/n8n-io/n8n/issues/14909
48+
- https://docs.n8n.io/flow-logic/error-handling/
49+
- https://community.n8n.io/t/workflow-randomly-cancels-mid-execution-without-error-data-items-silently-dropped-between-nodes/51141
50+
applications:
51+
- name: n8n
52+
version: ">= 1.90.0"
53+
processName: n8n
54+
containerName: n8n
55+
impact: |
56+
Silent data loss in workflow automation can cause critical business processes to fail
57+
without detection, leading to incomplete data processing, missing business transactions,
58+
failed integrations, and potential compliance violations. The silent nature makes it
59+
extremely difficult to detect and troubleshoot, potentially causing weeks or months
60+
of data integrity issues before discovery.
61+
impactScore: 9
62+
mitigationScore: 7
63+
metadata:
64+
kind: prequel
65+
id: N8nSilentDataLossDetection919
66+
gen: 1
67+
rule:
68+
sequence:
69+
window: 120s
70+
event:
71+
source: cre.log.n8n
72+
order:
73+
- regex: "(cancelled mid-execution|execution terminated unexpectedly|workflow.*cancelled|Execution.*cancelled)"
74+
- regex: "(silent data loss detected|data.*loss|itemsLost|dataIntegrityIssue.*true|Items processed inconsistently|Data integrity check failed|Expected [0-9]+ items, found [0-9]+ items)"

rules/cre-2025-0179/test.log

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Aug 27 18:30:29 n8n[1234]: INFO: Starting workflow execution exec_384574 for workflow workflow_9084
2+
Aug 27 18:35:29 n8n[1234]: DEBUG: Node processing started - HTTP Request node
3+
Aug 27 18:45:29 n8n[1234]: INFO: Processing 150 items through workflow pipeline
4+
Aug 27 18:53:29 n8n[1234]: DEBUG: Node completed with 142 items (expected 150)
5+
Aug 27 19:05:29 n8n[1234]: DEBUG: Transform node processing remaining items
6+
Aug 27 19:25:29 n8n[1234]: WARN: Execution exec_384574 cancelled mid-execution after 55 minutes
7+
Aug 27 19:25:44 n8n[1234]: ERROR: Data integrity check failed - Items processed inconsistently across nodes
8+
Aug 27 19:25:49 n8n[1234]: ERROR: Expected 150 items, found 127 items at completion
9+
Aug 27 19:26:15 n8n[1234]: CRITICAL: Massive data loss detected - Expected 500 items, found 75 items
10+
Aug 27 19:26:20 n8n[1234]: ERROR: Critical workflow failure detected - 85% data loss in processing pipeline
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
rules:
2+
- cre:
3+
id: CRE-2025-0200
4+
severity: 0
5+
title: AutoGPT Recursive Self-Analysis Loop Leading to Token Exhaustion and System Crash
6+
category: infinite-loop-problem
7+
author: prequel
8+
description: |
9+
- AutoGPT enters an infinite recursive loop when attempting to analyze and fix its own execution errors
10+
- The agent repeatedly tries to debug its own code, spawning new analysis tasks for each failure
11+
- Each iteration consumes API tokens and memory, eventually exhausting resources
12+
- The loop accelerates as error messages grow longer, consuming tokens exponentially
13+
- System becomes unresponsive and crashes with out-of-memory errors or API rate limit failures
14+
cause: |
15+
- AutoGPT's autonomous reasoning incorrectly identifies its own execution as a problem to solve
16+
- Lack of loop detection mechanisms allows unlimited recursive task spawning
17+
- Error context accumulation causes exponential growth in prompt size
18+
- Missing safeguards for self-referential task creation
19+
- Insufficient resource monitoring and circuit breakers for runaway processes
20+
tags:
21+
- autogpt
22+
- infinite-loop
23+
- token-exhaustion
24+
- autonomous-agents
25+
- llm
26+
- openai
27+
- recursive-analysis
28+
- critical-failure
29+
- memory-exhaustion
30+
- crash-loop
31+
- rate-limiting
32+
mitigation: |
33+
- Implement loop detection to identify and break recursive self-analysis patterns
34+
- Add resource consumption thresholds (tokens, memory, API calls) with automatic shutdown
35+
- Create task depth limits to prevent unlimited recursion
36+
- Implement circuit breakers that trigger after repeated similar failures
37+
- Add explicit blacklist for self-referential task creation
38+
- Monitor token usage rate and implement exponential backoff
39+
- Use separate monitoring process to detect and kill runaway AutoGPT instances
40+
- Implement task deduplication to prevent identical recursive operations
41+
references:
42+
- https://github.com/Significant-Gravitas/AutoGPT/issues/1994
43+
- https://github.com/Significant-Gravitas/AutoGPT/issues/3766
44+
- https://github.com/Significant-Gravitas/AutoGPT/issues/1543
45+
- https://jina.ai/news/auto-gpt-unmasked-hype-hard-truths-production-pitfalls/
46+
applications:
47+
- name: autogpt
48+
version: ">=0.3.0"
49+
- name: openai
50+
version: ">=0.27.0"
51+
impact: Complete system failure with resource exhaustion, potential financial losses from API overconsumption
52+
impactScore: 9
53+
mitigationScore: 3
54+
reports: 15
55+
metadata:
56+
kind: prequel
57+
id: 8qy5Et9NbNGgGxhBP7umKa
58+
gen: 1
59+
rule:
60+
set:
61+
window: 30s
62+
event:
63+
source: cre.log.autogpt
64+
match:
65+
- value: 'Entering recursive analysis loop'
66+
- value: 'COMMAND = analyze_code'
67+
- value: 'recursion depth'
68+
- value: 'RecursionError: maximum recursion depth exceeded'

rules/cre-2025-0200/test.log

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
2025-08-31 14:23:45.234 [INFO] [autogpt.main] Starting AutoGPT v0.5.1 with goal: "Optimize my Python code for better performance"
2+
2025-08-31 14:23:45.567 [INFO] [autogpt.llm] Initializing OpenAI API client with model gpt-4
3+
2025-08-31 14:23:46.102 [INFO] [autogpt.agent] Agent initialized with memory backend: LocalCache
4+
2025-08-31 14:23:47.234 [INFO] [autogpt.agent] COMMAND = analyze_code args: {"code": "def slow_function():\\n result = []\\n for i in range(1000000):\\n result.append(i**2)\\n return result"}
5+
2025-08-31 14:23:48.567 [ERROR] [autogpt.commands] Error executing analyze_code: JSONDecodeError in response
6+
2025-08-31 14:23:48.890 [WARN] [autogpt.agent] Entering recursive analysis loop to debug previous error
7+
2025-08-31 14:23:49.234 [INFO] [autogpt.agent] THOUGHTS: Previous command failed, need to analyze what went wrong
8+
2025-08-31 14:23:49.567 [INFO] [autogpt.agent] NEXT ACTION: COMMAND = analyze_code args: {"code": "analyze_code function from autogpt/commands/analyze_code.py", "recursion depth": 1}
9+
2025-08-31 14:23:50.123 [DEBUG] [autogpt.memory] Storing error context, current size: 2.3MB
10+
2025-08-31 14:23:50.890 [ERROR] [autogpt.commands] Error executing analyze_code: Cannot analyze own execution context
11+
2025-08-31 14:23:51.234 [WARN] [autogpt.agent] Thinking... need to fix my own error handling
12+
2025-08-31 14:23:51.678 [INFO] [autogpt.agent] COMMAND = analyze_code args: {"code": "autogpt error handling module", "recursion depth": 2}
13+
2025-08-31 14:23:52.345 [DEBUG] [autogpt.memory] Memory usage increasing: 5.7MB, token count: 8234
14+
2025-08-31 14:23:52.890 [ERROR] [autogpt.llm] API request failed: context length exceeded
15+
2025-08-31 14:23:53.234 [INFO] [autogpt.agent] SYSTEM: Potential loop detected but continuing to resolve errors
16+
2025-08-31 14:23:53.567 [INFO] [autogpt.agent] THOUGHTS: Error analyzing previous attempt, need deeper analysis
17+
2025-08-31 14:23:54.012 [INFO] [autogpt.agent] NEXT ACTION: COMMAND = analyze_code args: {"code": "full autogpt execution trace", "recursion depth": 3}
18+
2025-08-31 14:23:54.678 [WARN] [autogpt.monitor] Task queue growing: 12 pending tasks
19+
2025-08-31 14:23:55.234 [DEBUG] [autogpt.memory] Memory usage: 12.4MB, token count: 15672
20+
2025-08-31 14:23:55.890 [ERROR] [autogpt.commands] RecursionError: maximum recursion depth exceeded while calling analyze_code
21+
2025-08-31 14:23:56.345 [CRITICAL] [autogpt.agent] Task queue overflow: 47 pending recursive tasks
22+
2025-08-31 14:23:56.789 [INFO] [autogpt.agent] COMMAND = analyze_code args: {"code": "recursion error in analyze_code", "recursion depth": 4}
23+
2025-08-31 14:23:57.234 [ERROR] [autogpt.llm] openai.error.RateLimitError: Rate limit reached for gpt-4 in organization
24+
2025-08-31 14:23:57.567 [WARN] [autogpt.monitor] Token consumption rate: 2341 tokens/second
25+
2025-08-31 14:23:58.012 [DEBUG] [autogpt.memory] Memory usage critical: 45.8MB, token count: 42318
26+
2025-08-31 14:23:58.456 [ERROR] [autogpt.agent] Too many pending tasks: 89 in queue
27+
2025-08-31 14:23:58.890 [INFO] [autogpt.agent] THOUGHTS: Still analyzing previous errors, must understand the recursion
28+
2025-08-31 14:23:59.234 [INFO] [autogpt.agent] NEXT ACTION: COMMAND = analyze_code args: {"code": "entire autogpt error stack", "recursion depth": 5}
29+
2025-08-31 14:23:59.678 [CRITICAL] [autogpt.monitor] JavaScript heap out of memory
30+
2025-08-31 14:24:00.123 [ERROR] [autogpt.memory] MemoryError: Cannot allocate memory for context storage
31+
2025-08-31 14:24:00.456 [CRITICAL] [autogpt.agent] Task buffer exceeded: 156 recursive analyze_code calls pending
32+
2025-08-31 14:24:00.789 [ERROR] [autogpt.llm] API rate limit exceeded: 429 Too Many Requests
33+
2025-08-31 14:24:01.123 [FATAL] [autogpt.main] AutoGPT crashed: Unrecoverable recursive loop detected
34+
2025-08-31 14:24:01.234 [INFO] [autogpt.cleanup] Emergency shutdown initiated
35+
2025-08-31 14:24:01.345 [ERROR] [autogpt.cleanup] Failed to save state: Out of memory

rules/tags/categories.yaml

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -243,14 +243,4 @@ categories:
243243
displayName: MongoDB Startup Failure
244244
description: |
245245
Failures that prevent MongoDB from starting successfully due to corrupted metadata, invalid configurations,
246-
or unrecoverable internal errors (e.g., WiredTiger metadata corruption). These failures often require manual repair or backup restoration.
247-
- name: supabase-problem
248-
displayName: Supabase Problems
249-
description: |
250-
Problems specific to Supabase self-hosted deployments including authentication failures, database connectivity issues,
251-
storage misconfigurations, realtime service crashes, and infrastructure-related failures that affect the entire Supabase stack.
252-
- name: realtime-problem
253-
displayName: Realtime Problems
254-
description: |
255-
Failures in real-time communication systems including WebSocket connection issues, real-time subscription failures,
256-
and problems with live data streaming that affect user experience in interactive applications.
246+
or unrecoverable internal errors (e.g., WiredTiger metadata corruption). These failures often require manual repair or backup restoration.

rules/tags/tags.yaml

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -847,16 +847,4 @@ tags:
847847
description: Issues with Kubernetes pod scheduling due to resource constraints or networking problems
848848
- name: cluster-scaling
849849
displayName: Cluster Scaling
850-
description: Problems related to Kubernetes cluster scaling operations and capacity management
851-
- name: supabase
852-
displayName: Supabase
853-
description: Problems related to Supabase self-hosted deployments and services
854-
- name: gotrue
855-
displayName: GoTrue
856-
description: Problems related to Supabase's GoTrue authentication service
857-
- name: realtime
858-
displayName: Realtime
859-
description: Problems related to Supabase's realtime service and WebSocket connections
860-
- name: self-hosted
861-
displayName: Self-Hosted
862-
description: Problems specific to self-hosted deployments and infrastructure
850+
description: Problems related to Kubernetes cluster scaling operations and capacity management

0 commit comments

Comments
 (0)