CRE-2025-0162: Stable Diffusion WebUI CUDA Out of Memory Detection (#146)

piyzard · web-flow · commit c414539d7981 · 2025-09-03T10:49:05.000-05:00
* added cre

* init
diff --git a/rules/cre-2025-0162/stable-diffusion-cuda-oom.yaml b/rules/cre-2025-0162/stable-diffusion-cuda-oom.yaml
@@ -0,0 +1,70 @@
+rules:
+  - metadata:
+      kind: prequel
+      id: SD8xK2mN9pQzYvWr3aLfJ7
+      hash: XpQ9Lm4Zk8TnVb2Ry6HwGs
+    cre:
+      id: CRE-2025-0162
+      severity: 1
+      title: "Stable Diffusion WebUI CUDA Out of Memory Crash"
+      category: "memory-problem"
+      author: Prequel Community
+      description: |
+        Detects critical CUDA out of memory errors in Stable Diffusion WebUI that cause image generation failures and application crashes. This occurs when GPU VRAM is exhausted during model loading or image generation, resulting in complete task failure and potential WebUI instability.
+      cause: |
+        - Insufficient GPU VRAM for requested image resolution or batch size
+        - Memory fragmentation preventing large contiguous allocations
+        - Model loading exceeding available VRAM capacity
+        - Concurrent GPU processes consuming memory
+        - High-resolution image generation without memory optimization flags
+      impact: |
+        - Complete image generation failure
+        - WebUI crash requiring restart
+        - Loss of in-progress generation work
+        - Potential GPU driver instability
+        - Service unavailability for users
+      tags:
+        - memory
+        - nvidia
+        - crash
+        - out-of-memory
+        - configuration
+      mitigation: |
+        IMMEDIATE ACTIONS:
+        - Restart Stable Diffusion WebUI
+        - Clear GPU memory: nvidia-smi --gpu-reset
+        - Add memory optimization flags: --medvram or --lowvram
+        CONFIGURATION FIXES:
+        - For 4-6GB VRAM: Add --medvram to webui-user.bat
+        - For 2-4GB VRAM: Add --lowvram to webui-user.bat
+        - Enable xformers: --xformers for memory efficiency
+        - Add --always-batch-cond-uncond for batch processing
+        RUNTIME ADJUSTMENTS:
+        - Reduce image resolution (512x512 instead of 1024x1024)
+        - Decrease batch size to 1
+        - Lower batch count for multiple generations
+        - Set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
+        PREVENTION:
+        - Monitor GPU memory usage with nvidia-smi
+        - Implement gradual resolution scaling
+        - Use cloud services for high-resolution generation
+        - Upgrade to GPU with minimum 8GB VRAM
+      references:
+        - https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992
+        - https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9770
+        - https://github.com/CompVis/stable-diffusion/issues/39
+      applications:
+        - name: stable-diffusion-webui
+          version: ">=1.0.0"
+      impactScore: 8
+      mitigationScore: 7
+      reports: 15
+    rule:
+      set:
+        window: 120s
+        event:
+          source: cre.log.stable-diffusion
+        match:
+          - regex: 'OutOfMemoryError.*CUDA out of memory'
+          - regex: 'CUDA out of memory.*Tried to allocate'
+          - regex: 'model failed to load.*OutOfMemoryError'
diff --git a/rules/cre-2025-0162/test.log b/rules/cre-2025-0162/test.log
@@ -0,0 +1,14 @@
+2025-08-29 14:23:45.123 [ERROR] Loading model stable-diffusion-v1.5
+2025-08-29 14:23:47.456 [INFO] Model weights: 4.27 GB
+2025-08-29 14:23:48.789 [INFO] Allocating GPU memory...
+2025-08-29 14:23:49.012 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 4.50 GiB already allocated; 1.20 GiB free; 4.80 GiB reserved in total by PyTorch)
+2025-08-29 14:23:49.013 [ERROR] RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Process 12345 has 4.50 GiB memory in use.
+2025-08-29 14:23:49.014 [CRITICAL] Stable Diffusion model failed to load: OutOfMemoryError
+2025-08-29 14:23:49.015 [ERROR] CUDA error: out of memory
+2025-08-29 14:23:49.016 [ERROR] GPU 0 has a total capacity of 6.00 GiB of which 1.20 GiB is free. Allocation failed.
+2025-08-29 14:23:49.017 [ERROR] Failed to generate image: CUDA out of memory
+2025-08-29 14:23:49.018 [INFO] Attempting to clear cache...
+2025-08-29 14:23:50.123 [INFO] Cache cleared, retrying...
+2025-08-29 14:23:51.456 [ERROR] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB
+2025-08-29 14:23:51.457 [CRITICAL] Image generation failed after retry
+2025-08-29 14:23:51.458 [ERROR] WebUI shutting down due to memory error