Describe the issue
I am experiencing critical system failures (TDR, Black screen, and GPU Driver corruption) while running a CRAFT-based text detection model () using on an AMD RX 580 8GB.ch_PP-OCRv4_det_server_infer.onnxonnxruntime-directml
This issue has occurred 3 times, each requiring a hard reboot and a full reinstallation of the graphics driver to recover. Since my goal is to distribute this software to a wide range of users, ensuring long-term stability is my top priority. I need to know how to prevent these system-wide crashes on older Polaris-based GPUs.
Further information
Relevant Area: model usage, backend, operators
Is this issue related to a specific model?
Model name: ch_PP-OCRv4_det_server_infer.onnx
Model opset: 17
Notes
I am using the following configuration in Python to minimize VRAM usage, but the crashes persist:
Hardware: AMD Radeon RX 580 8GB
Input Size: 640x640 (Fixed)
Batch Size: 1
import onnxruntime as ort
opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.enable_mem_pattern = False
opts.enable_cpu_mem_arena = False
providers = [('DmlExecutionProvider', {'device_id': 0})]
session = ort.InferenceSession("model.onnx", sess_options=opts, providers=providers)
To reproduce
- Use a Windows 10 machine with AMD Radeon RX 580 8GB.
- Install/use onnxruntime-directml and run the app with DmlExecutionProvider enabled.
- Start an OCR workflow that completes text detection/recognition first.
- Immediately start the next inpaint step with the AOTGAN ONNX model.
- In my case, the crash happens right after OCR/prompt generation, at or immediately after AOTGAN starts its first DirectML inference.
- The system may freeze or black-screen, then reboot or require a hard reset.
- After reboot, Windows Event Viewer shows LiveKernelEvent 141 followed by LiveKernelEvent 1b0 / WATCHDOG-related reports.
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-directml >= 1.24.4
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
Describe the issue
I am experiencing critical system failures (TDR, Black screen, and GPU Driver corruption) while running a CRAFT-based text detection model () using on an AMD RX 580 8GB.ch_PP-OCRv4_det_server_infer.onnxonnxruntime-directml
This issue has occurred 3 times, each requiring a hard reboot and a full reinstallation of the graphics driver to recover. Since my goal is to distribute this software to a wide range of users, ensuring long-term stability is my top priority. I need to know how to prevent these system-wide crashes on older Polaris-based GPUs.
Further information
Relevant Area: model usage, backend, operators
Is this issue related to a specific model?
Model name: ch_PP-OCRv4_det_server_infer.onnx
Model opset: 17
Notes
I am using the following configuration in Python to minimize VRAM usage, but the crashes persist:
Hardware: AMD Radeon RX 580 8GB
Input Size: 640x640 (Fixed)
Batch Size: 1
import onnxruntime as ort
opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.enable_mem_pattern = False
opts.enable_cpu_mem_arena = False
providers = [('DmlExecutionProvider', {'device_id': 0})]
session = ort.InferenceSession("model.onnx", sess_options=opts, providers=providers)
To reproduce
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-directml >= 1.24.4
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response