-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Problem Description
Environment:
OS: Windows 11
GPU: AMD9070
AMD Driver: AMD Software: Adrenalin Edition (Latest version as of August 2025)
Python Versions Tried: 3.12.10
PyTorch/ROCm Package Index: https://d2awnip2yjpvqn.cloudfront.net/v2/gfx120X-all/
Problem:
When using ComfyUI, if the VRAM usage exceeds the physical VRAM capacity (16GB), the entire system crashes instantly and catastrophically. The symptoms are:
The screen goes black (no signal).
The motherboard's VGA fault LED turns on (red light).
The system becomes completely unresponsive, forcing a hard shutdown by holding the power button.
This issue is consistently triggered by VRAM-intensive operations, such as:
Using any upscaler node.
Generating or processing images with resolutions exceeding 720x720.
Workarounds Attempted:
I've experimented with ComfyUI's command-line arguments and observed the following behaviors:
Without any arguments: The system crashes as described above whenever VRAM usage surpasses 16GB during intensive tasks.
With --disable-smart-memory --reserve-vram 4:
This setting prevents the hard crash and the need for a forced shutdown.
However, when VRAM is exhausted, the screen flickers to black once and then recovers.
After recovery, many Windows 11 functionalities become limited or unresponsive (e.g., the taskbar doesn't respond to clicks, interaction with some applications is broken), indicating system instability.
With --disable-smart-memory --reserve-vram 8:
This configuration provides the best results, significantly improving stability and greatly reducing the frequency of crashes.
Core of the Issue:
My core suspicion is that there is a critical issue with VRAM management within the ROCm stack or the AMD graphics driver.
Under Windows, the expected behavior when dedicated VRAM is exhausted is for the driver and the OS (WDDM) to begin utilizing system RAM as Shared GPU Memory to handle the overflow. While this incurs a significant performance penalty, it should prevent a system failure. Eventually, if the combined memory is still insufficient, the application should receive an "Out of Memory" error and terminate gracefully, leaving the OS stable.
However, what appears to be happening here is that once VRAM pressure reaches a critical point, the driver fails catastrophically during the attempt to page to or manage this shared memory. Instead of gracefully handling the memory pressure or returning an "Out of Memory" error to the application, the GPU driver itself seems to crash, which in turn hangs the entire system (leading to the black screen and motherboard fault light).
The fact that the --reserve-vram argument mitigates the problem reinforces this theory. By preventing the application from using the last portion of VRAM, it keeps the driver from entering this high-pressure state where the faulty memory management logic is triggered. This strongly suggests the root cause is a driver-level bug in handling the transition from VRAM to shared system memory, rather than an issue with the application itself.
Operating System
Windows 11
CPU
U7 265K
GPU
9700
ROCm Version
7.0.0rc20250821
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status