Skip to content

[Issue]: Severe System Crash on VRAM Overflow with ComfyUI on Windows 11 #1320

@mykite

Description

@mykite

Problem Description

Environment:
OS: Windows 11
GPU: AMD9070
AMD Driver: AMD Software: Adrenalin Edition (Latest version as of August 2025)
Python Versions Tried: 3.12.10
PyTorch/ROCm Package Index: https://d2awnip2yjpvqn.cloudfront.net/v2/gfx120X-all/
Problem:
When using ComfyUI, if the VRAM usage exceeds the physical VRAM capacity (16GB), the entire system crashes instantly and catastrophically. The symptoms are:
The screen goes black (no signal).
The motherboard's VGA fault LED turns on (red light).
The system becomes completely unresponsive, forcing a hard shutdown by holding the power button.
This issue is consistently triggered by VRAM-intensive operations, such as:
Using any upscaler node.
Generating or processing images with resolutions exceeding 720x720.
Workarounds Attempted:
I've experimented with ComfyUI's command-line arguments and observed the following behaviors:
Without any arguments: The system crashes as described above whenever VRAM usage surpasses 16GB during intensive tasks.
With --disable-smart-memory --reserve-vram 4:
This setting prevents the hard crash and the need for a forced shutdown.
However, when VRAM is exhausted, the screen flickers to black once and then recovers.
After recovery, many Windows 11 functionalities become limited or unresponsive (e.g., the taskbar doesn't respond to clicks, interaction with some applications is broken), indicating system instability.
With --disable-smart-memory --reserve-vram 8:
This configuration provides the best results, significantly improving stability and greatly reducing the frequency of crashes.
Core of the Issue:
My core suspicion is that there is a critical issue with VRAM management within the ROCm stack or the AMD graphics driver.
Under Windows, the expected behavior when dedicated VRAM is exhausted is for the driver and the OS (WDDM) to begin utilizing system RAM as Shared GPU Memory to handle the overflow. While this incurs a significant performance penalty, it should prevent a system failure. Eventually, if the combined memory is still insufficient, the application should receive an "Out of Memory" error and terminate gracefully, leaving the OS stable.
However, what appears to be happening here is that once VRAM pressure reaches a critical point, the driver fails catastrophically during the attempt to page to or manage this shared memory. Instead of gracefully handling the memory pressure or returning an "Out of Memory" error to the application, the GPU driver itself seems to crash, which in turn hangs the entire system (leading to the black screen and motherboard fault light).
The fact that the --reserve-vram argument mitigates the problem reinforces this theory. By preventing the application from using the last portion of VRAM, it keeps the driver from entering this high-pressure state where the faulty memory management logic is triggered. This strongly suggests the root cause is a driver-level bug in handling the transition from VRAM to shared system memory, rather than an issue with the application itself.

Operating System

Windows 11

CPU

U7 265K

GPU

9700

ROCm Version

7.0.0rc20250821

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

status: triageIndicates an issue has been assigned for investigation.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions