diff --git a/rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml b/rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml new file mode 100644 index 0000000..297cde6 --- /dev/null +++ b/rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml @@ -0,0 +1,84 @@ +rules: + - metadata: + kind: prequel + id: StableDiffusionCUDAOOMDetector + generation: 1 + cre: + id: CRE-2025-0130 + severity: 0 + title: Stable Diffusion WebUI Critical CUDA Out of Memory Failure + category: memory-problem + author: CRE Community + description: | + - The Stable Diffusion WebUI (AUTOMATIC1111) is experiencing critical CUDA out of memory errors during image generation. + - This typically occurs when attempting to generate high-resolution images or large batch sizes that exceed available GPU VRAM. + - The failure cascades from initial memory allocation errors to complete WebUI unresponsiveness and service failure. + - This is one of the most common and disruptive failures affecting Stable Diffusion deployments. + cause: | + - GPU VRAM exhaustion due to insufficient memory for model loading and tensor operations. + - High-resolution image generation (e.g., 1024x1024 or larger) requiring more memory than available. + - Large batch sizes that multiply memory requirements beyond GPU capacity. + - Memory fragmentation from previous operations that prevents allocation of required contiguous memory blocks. + - Inefficient model loading or caching that consumes excessive VRAM. + - Running multiple concurrent generation processes without proper memory management. + impact: | + - Complete service interruption - the WebUI becomes unresponsive and requires manual restart. + - Loss of current generation progress and any queued generation tasks. + - Potential CUDA context corruption requiring process restart to recover. + - User experience degradation with failed image generations and error messages. + - System instability in multi-user deployments where one user's OOM can affect others. + - Cascading failures where recovery attempts also fail due to memory constraints. + tags: + - memory-exhaustion + - crash + - errors + - service + - python + - memory + - oom-kill + - critical-failure + - cuda + - pytorch + mitigation: | + - **Immediate Response:** + - Restart the Stable Diffusion WebUI process to clear CUDA context and reset memory state. + - Check GPU memory usage with `nvidia-smi` to verify memory is properly released after restart. + - **Configuration Adjustments:** + - Add command line arguments: `--medvram` (moderate memory reduction) or `--lowvram` (aggressive memory reduction). + - Use `--opt-sdp-no-mem-attention` or `--xformers` to enable memory-efficient attention mechanisms. + - Set environment variable: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128`. + - **Generation Parameter Tuning:** + - Reduce image resolution (e.g., from 1024x1024 to 768x768 or 512x512). + - Decrease batch size from 4+ to 1-2 images per generation. + - Enable "Tiled VAE" extension for high-resolution images to reduce VRAM usage during decoding. + - **System-Level Solutions:** + - Upgrade to GPU with more VRAM (12GB+ recommended for high-resolution work). + - Monitor GPU memory usage proactively and set alerts before reaching 90% capacity. + - Implement resource limits in multi-user deployments to prevent memory monopolization. + - **Preventative Measures:** + - Install memory monitoring extensions like VRAM-ESTIMATOR to track usage in real-time. + - Educate users on appropriate generation parameters for their hardware. + - Implement automatic parameter adjustment based on available VRAM. + references: + - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/16114" + - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992" + - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13878" + - "https://www.aiarty.com/stable-diffusion-guide/fix-cuda-out-of-memory-stable-diffusion.htm" + applications: + - name: stable-diffusion-webui + processName: python + repoUrl: "https://github.com/AUTOMATIC1111/stable-diffusion-webui" + version: "*" + - name: pytorch + version: "*" + impactScore: 9 + mitigationScore: 3 + reports: 1 + rule: + sequence: + window: 30s + event: + source: cre.log.stable-diffusion + order: + - regex: "torch\\.cuda\\.OutOfMemoryError: CUDA out of memory" + - regex: "Fatal error during image generation|Complete service failure" \ No newline at end of file diff --git a/rules/cre-2025-0130/test.log b/rules/cre-2025-0130/test.log new file mode 100644 index 0000000..7030f96 --- /dev/null +++ b/rules/cre-2025-0130/test.log @@ -0,0 +1,44 @@ +[2025-08-27 12:50:17,000] INFO [WebUI] Initializing Stable Diffusion WebUI +Initializing Stable Diffusion WebUI +[2025-08-27 12:50:17,000] INFO [ModelLoader] Loading checkpoints/v1-5-pruned.safetensors +Loading checkpoints/v1-5-pruned.safetensors +[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA available: True +CUDA available: True +[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device count: 1 +CUDA device count: 1 +[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device: GeForce RTX 3060 (6GB) +CUDA device: GeForce RTX 3060 (6GB) +[2025-08-27 12:50:19,000] INFO [WebUI] Starting image generation: 1024x1024, batch_size=4 +Starting image generation: 1024x1024, batch_size=4 +[2025-08-27 12:50:19,000] INFO [ModelLoader] Loading model to CUDA device +Loading model to CUDA device +[2025-08-27 12:50:20,000] WARN [TorchCUDA] GPU memory usage: 5.8GB/6.0GB (97%) +GPU memory usage: 5.8GB/6.0GB (97%) +[2025-08-27 12:50:20,000] WARN [ModelLoader] High memory usage detected during model loading +High memory usage detected during model loading +[2025-08-27 12:50:21,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch) +torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch) +[2025-08-27 12:50:21,000] CRITICAL [WebUI] Fatal error during image generation +Fatal error during image generation +[2025-08-27 12:50:21,000] ERROR [ModelLoader] Failed to allocate tensor on device +Failed to allocate tensor on device +[2025-08-27 12:50:21,000] ERROR [TorchCUDA] RuntimeError: CUDA out of memory +RuntimeError: CUDA out of memory +[2025-08-27 12:50:22,000] ERROR [WebUI] Generation process crashed +Generation process crashed +[2025-08-27 12:50:22,000] ERROR [WebUI] Gradio interface becoming unresponsive +Gradio interface becoming unresponsive +[2025-08-27 12:50:22,000] WARN [WebUI] Multiple failed generation attempts detected +Multiple failed generation attempts detected +[2025-08-27 12:50:22,000] ERROR [TorchCUDA] CUDA context may be corrupted +CUDA context may be corrupted +[2025-08-27 12:50:23,000] INFO [WebUI] Attempting to recover from OOM error +Attempting to recover from OOM error +[2025-08-27 12:50:23,000] WARN [TorchCUDA] Clearing CUDA cache +Clearing CUDA cache +[2025-08-27 12:50:23,000] ERROR [WebUI] Recovery failed - WebUI requires restart +Recovery failed - WebUI requires restart +[2025-08-27 12:50:24,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB +torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB +[2025-08-27 12:50:24,000] CRITICAL [WebUI] Complete service failure - manual intervention required +Complete service failure - manual intervention required diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml index 270f330..e8bec8d 100644 --- a/rules/tags/tags.yaml +++ b/rules/tags/tags.yaml @@ -845,6 +845,12 @@ tags: - name: cluster-scaling displayName: Cluster Scaling description: Problems related to Kubernetes cluster scaling operations and capacity management + - name: cuda + displayName: CUDA + description: Problems related to NVIDIA CUDA GPU computing platform and memory management + - name: pytorch + displayName: PyTorch + description: Problems related to PyTorch deep learning framework and tensor operation - name: autogpt displayName: AutoGPT description: Problems related to AutoGPT autonomous AI agent framework @@ -865,7 +871,7 @@ tags: description: Problems related to OpenAI API services including GPT models - name: recursive-analysis displayName: Recursive Analysis - description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion + description: Problems where systems enter recursive self-analysis loops leading to resource exhaustio - name: n8n displayName: N8N description: Problems related to n8n workflow automation platform