[DirectML] System Crash and Driver Corruption on AMD RX 580 during OCR Inference

### Describe the issue

I am experiencing critical system failures (TDR, Black screen, and GPU Driver corruption) while running a CRAFT-based text detection model () using on an AMD RX 580 8GB.ch_PP-OCRv4_det_server_infer.onnxonnxruntime-directml

This issue has occurred 3 times, each requiring a hard reboot and a full reinstallation of the graphics driver to recover. Since my goal is to distribute this software to a wide range of users, ensuring long-term stability is my top priority. I need to know how to prevent these system-wide crashes on older Polaris-based GPUs.

Further information
Relevant Area: model usage, backend, operators
Is this issue related to a specific model?
Model name: ch_PP-OCRv4_det_server_infer.onnx
Model opset: 17



Notes
I am using the following configuration in Python to minimize VRAM usage, but the crashes persist:

# Hardware: AMD Radeon RX 580 8GB
# Input Size: 640x640 (Fixed)
# Batch Size: 1

import onnxruntime as ort

opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.enable_mem_pattern = False
opts.enable_cpu_mem_arena = False

providers = [('DmlExecutionProvider', {'device_id': 0})]
session = ort.InferenceSession("model.onnx", sess_options=opts, providers=providers)

### To reproduce

1. Use a Windows 10 machine with AMD Radeon RX 580 8GB.
2. Install/use onnxruntime-directml and run the app with DmlExecutionProvider enabled.
3. Start an OCR workflow that completes text detection/recognition first.
4. Immediately start the next inpaint step with the AOTGAN ONNX model.
5. In my case, the crash happens right after OCR/prompt generation, at or immediately after AOTGAN starts its first DirectML inference.
6. The system may freeze or black-screen, then reboot or require a hard reset.
7. After reboot, Windows Event Viewer shows LiveKernelEvent 141 followed by LiveKernelEvent 1b0 / WATCHDOG-related reports.


### Urgency

_No response_

### Platform

Windows

### OS Version

10

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

onnxruntime-directml >= 1.24.4

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

DirectML

### Execution Provider Library Version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DirectML] System Crash and Driver Corruption on AMD RX 580 during OCR Inference #29263

Describe the issue

Hardware: AMD Radeon RX 580 8GB

Input Size: 640x640 (Fixed)

Batch Size: 1

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[DirectML] System Crash and Driver Corruption on AMD RX 580 during OCR Inference #29263

Description

Describe the issue

Hardware: AMD Radeon RX 580 8GB

Input Size: 640x640 (Fixed)

Batch Size: 1

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions