prequel-dev · MAVRICK-1 · Aug 27, 2025 · Sep 3, 2025 · Sep 3, 2025
diff --git a/rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml b/rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml
@@ -0,0 +1,84 @@
+rules:
+  - metadata:
+      kind: prequel
+      id: StableDiffusionCUDAOOMDetector
+      generation: 1
+    cre:
+      id: CRE-2025-0130
+      severity: 0
+      title: Stable Diffusion WebUI Critical CUDA Out of Memory Failure
+      category: memory-problem
+      author: CRE Community
+      description: |
+        - The Stable Diffusion WebUI (AUTOMATIC1111) is experiencing critical CUDA out of memory errors during image generation.
+        - This typically occurs when attempting to generate high-resolution images or large batch sizes that exceed available GPU VRAM.
+        - The failure cascades from initial memory allocation errors to complete WebUI unresponsiveness and service failure.
+        - This is one of the most common and disruptive failures affecting Stable Diffusion deployments.
+      cause: |
+        - GPU VRAM exhaustion due to insufficient memory for model loading and tensor operations.
+        - High-resolution image generation (e.g., 1024x1024 or larger) requiring more memory than available.
+        - Large batch sizes that multiply memory requirements beyond GPU capacity.
+        - Memory fragmentation from previous operations that prevents allocation of required contiguous memory blocks.
+        - Inefficient model loading or caching that consumes excessive VRAM.
+        - Running multiple concurrent generation processes without proper memory management.
+      impact: |
+        - Complete service interruption - the WebUI becomes unresponsive and requires manual restart.
+        - Loss of current generation progress and any queued generation tasks.
+        - Potential CUDA context corruption requiring process restart to recover.
+        - User experience degradation with failed image generations and error messages.
+        - System instability in multi-user deployments where one user's OOM can affect others.
+        - Cascading failures where recovery attempts also fail due to memory constraints.
+      tags:
+        - memory-exhaustion
+        - crash
+        - errors
+        - service
+        - python
+        - memory
+        - oom-kill
+        - critical-failure
+        - cuda
+        - pytorch
+      mitigation: |
+        - **Immediate Response:**
+          - Restart the Stable Diffusion WebUI process to clear CUDA context and reset memory state.
+          - Check GPU memory usage with `nvidia-smi` to verify memory is properly released after restart.
+        - **Configuration Adjustments:**
+          - Add command line arguments: `--medvram` (moderate memory reduction) or `--lowvram` (aggressive memory reduction).
+          - Use `--opt-sdp-no-mem-attention` or `--xformers` to enable memory-efficient attention mechanisms.
+          - Set environment variable: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128`.
+        - **Generation Parameter Tuning:**
+          - Reduce image resolution (e.g., from 1024x1024 to 768x768 or 512x512).
+          - Decrease batch size from 4+ to 1-2 images per generation.
+          - Enable "Tiled VAE" extension for high-resolution images to reduce VRAM usage during decoding.
+        - **System-Level Solutions:**
+          - Upgrade to GPU with more VRAM (12GB+ recommended for high-resolution work).
+          - Monitor GPU memory usage proactively and set alerts before reaching 90% capacity.
+          - Implement resource limits in multi-user deployments to prevent memory monopolization.
+        - **Preventative Measures:**
+          - Install memory monitoring extensions like VRAM-ESTIMATOR to track usage in real-time.
+          - Educate users on appropriate generation parameters for their hardware.
+          - Implement automatic parameter adjustment based on available VRAM.
+      references:
+        - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/16114"
+        - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992"
+        - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13878"
+        - "https://www.aiarty.com/stable-diffusion-guide/fix-cuda-out-of-memory-stable-diffusion.htm"
+      applications:
+        - name: stable-diffusion-webui
+          processName: python
+          repoUrl: "https://github.com/AUTOMATIC1111/stable-diffusion-webui"
+          version: "*"
+        - name: pytorch
+          version: "*"
+      impactScore: 9
+      mitigationScore: 3
+      reports: 1
+    rule:
+      sequence:
+        window: 30s
+        event:
+          source: cre.log.stable-diffusion
+        order:
+          - regex: "torch\\.cuda\\.OutOfMemoryError: CUDA out of memory"
+          - regex: "Fatal error during image generation|Complete service failure"
diff --git a/rules/cre-2025-0130/test.log b/rules/cre-2025-0130/test.log
@@ -0,0 +1,44 @@
+[2025-08-27 12:50:17,000] INFO [WebUI] Initializing Stable Diffusion WebUI
+Initializing Stable Diffusion WebUI
+[2025-08-27 12:50:17,000] INFO [ModelLoader] Loading checkpoints/v1-5-pruned.safetensors
+Loading checkpoints/v1-5-pruned.safetensors
+[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA available: True
+CUDA available: True
+[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device count: 1
+CUDA device count: 1
+[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device: GeForce RTX 3060 (6GB)
+CUDA device: GeForce RTX 3060 (6GB)
+[2025-08-27 12:50:19,000] INFO [WebUI] Starting image generation: 1024x1024, batch_size=4
+Starting image generation: 1024x1024, batch_size=4
+[2025-08-27 12:50:19,000] INFO [ModelLoader] Loading model to CUDA device
+Loading model to CUDA device
+[2025-08-27 12:50:20,000] WARN [TorchCUDA] GPU memory usage: 5.8GB/6.0GB (97%)
+GPU memory usage: 5.8GB/6.0GB (97%)
+[2025-08-27 12:50:20,000] WARN [ModelLoader] High memory usage detected during model loading
+High memory usage detected during model loading
+[2025-08-27 12:50:21,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch)
+torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch)
+[2025-08-27 12:50:21,000] CRITICAL [WebUI] Fatal error during image generation
+Fatal error during image generation
+[2025-08-27 12:50:21,000] ERROR [ModelLoader] Failed to allocate tensor on device
+Failed to allocate tensor on device
+[2025-08-27 12:50:21,000] ERROR [TorchCUDA] RuntimeError: CUDA out of memory
+RuntimeError: CUDA out of memory
+[2025-08-27 12:50:22,000] ERROR [WebUI] Generation process crashed
+Generation process crashed
+[2025-08-27 12:50:22,000] ERROR [WebUI] Gradio interface becoming unresponsive
+Gradio interface becoming unresponsive
+[2025-08-27 12:50:22,000] WARN [WebUI] Multiple failed generation attempts detected
+Multiple failed generation attempts detected
+[2025-08-27 12:50:22,000] ERROR [TorchCUDA] CUDA context may be corrupted
+CUDA context may be corrupted
+[2025-08-27 12:50:23,000] INFO [WebUI] Attempting to recover from OOM error
+Attempting to recover from OOM error
+[2025-08-27 12:50:23,000] WARN [TorchCUDA] Clearing CUDA cache
+Clearing CUDA cache
+[2025-08-27 12:50:23,000] ERROR [WebUI] Recovery failed - WebUI requires restart
+Recovery failed - WebUI requires restart
+[2025-08-27 12:50:24,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB
+torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB
+[2025-08-27 12:50:24,000] CRITICAL [WebUI] Complete service failure - manual intervention required
+Complete service failure - manual intervention required
diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml
@@ -845,6 +845,12 @@ tags:
   - name: cluster-scaling
     displayName: Cluster Scaling
     description: Problems related to Kubernetes cluster scaling operations and capacity management
+  - name: cuda
+    displayName: CUDA
+    description: Problems related to NVIDIA CUDA GPU computing platform and memory management
+  - name: pytorch
+    displayName: PyTorch
+    description: Problems related to PyTorch deep learning framework and tensor operation
   - name: autogpt
     displayName: AutoGPT
     description: Problems related to AutoGPT autonomous AI agent framework
@@ -865,7 +871,7 @@ tags:
     description: Problems related to OpenAI API services including GPT models
   - name: recursive-analysis
     displayName: Recursive Analysis
-    description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion
+    description: Problems where systems enter recursive self-analysis loops leading to resource exhaustio
   - name: n8n
     displayName: N8N
     description: Problems related to n8n workflow automation platform