Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
rules:
- metadata:
kind: prequel
id: StableDiffusionCUDAOOMDetector
generation: 1
cre:
id: CRE-2025-0130
severity: 0
title: Stable Diffusion WebUI Critical CUDA Out of Memory Failure
category: memory-problem
author: CRE Community
description: |
- The Stable Diffusion WebUI (AUTOMATIC1111) is experiencing critical CUDA out of memory errors during image generation.
- This typically occurs when attempting to generate high-resolution images or large batch sizes that exceed available GPU VRAM.
- The failure cascades from initial memory allocation errors to complete WebUI unresponsiveness and service failure.
- This is one of the most common and disruptive failures affecting Stable Diffusion deployments.
cause: |
- GPU VRAM exhaustion due to insufficient memory for model loading and tensor operations.
- High-resolution image generation (e.g., 1024x1024 or larger) requiring more memory than available.
- Large batch sizes that multiply memory requirements beyond GPU capacity.
- Memory fragmentation from previous operations that prevents allocation of required contiguous memory blocks.
- Inefficient model loading or caching that consumes excessive VRAM.
- Running multiple concurrent generation processes without proper memory management.
impact: |
- Complete service interruption - the WebUI becomes unresponsive and requires manual restart.
- Loss of current generation progress and any queued generation tasks.
- Potential CUDA context corruption requiring process restart to recover.
- User experience degradation with failed image generations and error messages.
- System instability in multi-user deployments where one user's OOM can affect others.
- Cascading failures where recovery attempts also fail due to memory constraints.
tags:
- memory-exhaustion
- crash
- errors
- service
- python
- memory
- oom-kill
- critical-failure
- cuda
- pytorch
mitigation: |
- **Immediate Response:**
- Restart the Stable Diffusion WebUI process to clear CUDA context and reset memory state.
- Check GPU memory usage with `nvidia-smi` to verify memory is properly released after restart.
- **Configuration Adjustments:**
- Add command line arguments: `--medvram` (moderate memory reduction) or `--lowvram` (aggressive memory reduction).
- Use `--opt-sdp-no-mem-attention` or `--xformers` to enable memory-efficient attention mechanisms.
- Set environment variable: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128`.
- **Generation Parameter Tuning:**
- Reduce image resolution (e.g., from 1024x1024 to 768x768 or 512x512).
- Decrease batch size from 4+ to 1-2 images per generation.
- Enable "Tiled VAE" extension for high-resolution images to reduce VRAM usage during decoding.
- **System-Level Solutions:**
- Upgrade to GPU with more VRAM (12GB+ recommended for high-resolution work).
- Monitor GPU memory usage proactively and set alerts before reaching 90% capacity.
- Implement resource limits in multi-user deployments to prevent memory monopolization.
- **Preventative Measures:**
- Install memory monitoring extensions like VRAM-ESTIMATOR to track usage in real-time.
- Educate users on appropriate generation parameters for their hardware.
- Implement automatic parameter adjustment based on available VRAM.
references:
- "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/16114"
- "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992"
- "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13878"
- "https://www.aiarty.com/stable-diffusion-guide/fix-cuda-out-of-memory-stable-diffusion.htm"
applications:
- name: stable-diffusion-webui
processName: python
repoUrl: "https://github.com/AUTOMATIC1111/stable-diffusion-webui"
version: "*"
- name: pytorch
version: "*"
impactScore: 9
mitigationScore: 3
reports: 1
rule:
sequence:
window: 30s
event:
source: cre.log.stable-diffusion
order:
- regex: "torch\\.cuda\\.OutOfMemoryError: CUDA out of memory"
- regex: "Fatal error during image generation|Complete service failure"
44 changes: 44 additions & 0 deletions rules/cre-2025-0130/test.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[2025-08-27 12:50:17,000] INFO [WebUI] Initializing Stable Diffusion WebUI
Initializing Stable Diffusion WebUI
[2025-08-27 12:50:17,000] INFO [ModelLoader] Loading checkpoints/v1-5-pruned.safetensors
Loading checkpoints/v1-5-pruned.safetensors
[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA available: True
CUDA available: True
[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device count: 1
CUDA device count: 1
[2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device: GeForce RTX 3060 (6GB)
CUDA device: GeForce RTX 3060 (6GB)
[2025-08-27 12:50:19,000] INFO [WebUI] Starting image generation: 1024x1024, batch_size=4
Starting image generation: 1024x1024, batch_size=4
[2025-08-27 12:50:19,000] INFO [ModelLoader] Loading model to CUDA device
Loading model to CUDA device
[2025-08-27 12:50:20,000] WARN [TorchCUDA] GPU memory usage: 5.8GB/6.0GB (97%)
GPU memory usage: 5.8GB/6.0GB (97%)
[2025-08-27 12:50:20,000] WARN [ModelLoader] High memory usage detected during model loading
High memory usage detected during model loading
[2025-08-27 12:50:21,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch)
[2025-08-27 12:50:21,000] CRITICAL [WebUI] Fatal error during image generation
Fatal error during image generation
[2025-08-27 12:50:21,000] ERROR [ModelLoader] Failed to allocate tensor on device
Failed to allocate tensor on device
[2025-08-27 12:50:21,000] ERROR [TorchCUDA] RuntimeError: CUDA out of memory
RuntimeError: CUDA out of memory
[2025-08-27 12:50:22,000] ERROR [WebUI] Generation process crashed
Generation process crashed
[2025-08-27 12:50:22,000] ERROR [WebUI] Gradio interface becoming unresponsive
Gradio interface becoming unresponsive
[2025-08-27 12:50:22,000] WARN [WebUI] Multiple failed generation attempts detected
Multiple failed generation attempts detected
[2025-08-27 12:50:22,000] ERROR [TorchCUDA] CUDA context may be corrupted
CUDA context may be corrupted
[2025-08-27 12:50:23,000] INFO [WebUI] Attempting to recover from OOM error
Attempting to recover from OOM error
[2025-08-27 12:50:23,000] WARN [TorchCUDA] Clearing CUDA cache
Clearing CUDA cache
[2025-08-27 12:50:23,000] ERROR [WebUI] Recovery failed - WebUI requires restart
Recovery failed - WebUI requires restart
[2025-08-27 12:50:24,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB
[2025-08-27 12:50:24,000] CRITICAL [WebUI] Complete service failure - manual intervention required
Complete service failure - manual intervention required
8 changes: 7 additions & 1 deletion rules/tags/tags.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -845,6 +845,12 @@ tags:
- name: cluster-scaling
displayName: Cluster Scaling
description: Problems related to Kubernetes cluster scaling operations and capacity management
- name: cuda
displayName: CUDA
description: Problems related to NVIDIA CUDA GPU computing platform and memory management
- name: pytorch
displayName: PyTorch
description: Problems related to PyTorch deep learning framework and tensor operation
- name: autogpt
displayName: AutoGPT
description: Problems related to AutoGPT autonomous AI agent framework
Expand All @@ -865,7 +871,7 @@ tags:
description: Problems related to OpenAI API services including GPT models
- name: recursive-analysis
displayName: Recursive Analysis
description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion
description: Problems where systems enter recursive self-analysis loops leading to resource exhaustio
- name: n8n
displayName: N8N
description: Problems related to n8n workflow automation platform
Expand Down
Loading