Skip to content

[DirectML] System Crash and Driver Corruption on AMD RX 580 during OCR Inference #29263

Description

@aksmfakt132-lab

Describe the issue

I am experiencing critical system failures (TDR, Black screen, and GPU Driver corruption) while running a CRAFT-based text detection model () using on an AMD RX 580 8GB.ch_PP-OCRv4_det_server_infer.onnxonnxruntime-directml

This issue has occurred 3 times, each requiring a hard reboot and a full reinstallation of the graphics driver to recover. Since my goal is to distribute this software to a wide range of users, ensuring long-term stability is my top priority. I need to know how to prevent these system-wide crashes on older Polaris-based GPUs.

Further information
Relevant Area: model usage, backend, operators
Is this issue related to a specific model?
Model name: ch_PP-OCRv4_det_server_infer.onnx
Model opset: 17

Notes
I am using the following configuration in Python to minimize VRAM usage, but the crashes persist:

Hardware: AMD Radeon RX 580 8GB

Input Size: 640x640 (Fixed)

Batch Size: 1

import onnxruntime as ort

opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.enable_mem_pattern = False
opts.enable_cpu_mem_arena = False

providers = [('DmlExecutionProvider', {'device_id': 0})]
session = ort.InferenceSession("model.onnx", sess_options=opts, providers=providers)

To reproduce

  1. Use a Windows 10 machine with AMD Radeon RX 580 8GB.
  2. Install/use onnxruntime-directml and run the app with DmlExecutionProvider enabled.
  3. Start an OCR workflow that completes text detection/recognition first.
  4. Immediately start the next inpaint step with the AOTGAN ONNX model.
  5. In my case, the crash happens right after OCR/prompt generation, at or immediately after AOTGAN starts its first DirectML inference.
  6. The system may freeze or black-screen, then reboot or require a hard reset.
  7. After reboot, Windows Event Viewer shows LiveKernelEvent 141 followed by LiveKernelEvent 1b0 / WATCHDOG-related reports.

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-directml >= 1.24.4

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:DMLissues related to the DirectML execution provider

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions