An out-of-tree Execution Provider for ONNXRuntime that uses AMD's hipDNN library for accelerated inference on AMD GPUs.
Work in Progress - This is a prototype implementation.
- Conv2D - via hipDNN graph API
- MatMul/Gemm - via hipDNN graph API
- hipBLAS-LT support - Currently disabled. When re-enabled, provides an alternative MatMul/Gemm backend via hipBLAS-LT.
- Torch-MLIR integration - Experimental IR-based compilation pipeline. Enable with
HIPDNN_EP_ENABLE_TORCH_MLIR=ON.
| Dependency | Commit |
|---|---|
| TheRock | 9639502b |
| IREE | db9d11e4 |
- CMake 3.20+
- Ninja build system
- HIP SDK (from TheRock)
- hipDNN library (from TheRock)
- hipBLAS-LT (optional, from TheRock) - alternative MatMul/Gemm backend (currently disabled)
- ONNXRuntime (source and built library)
- iree-compile (required by hipDNN backend for code generation)
- Python 3 with
onnxpackage (for test model generation)
export THEROCK_DIST="/path/to/TheRock/build/dist/rocm"
export ONNXRUNTIME_ROOT="/path/to/onnxruntime"cd hipDNNEP
# Configure
cmake --preset RelWithDebInfo
# Build
cmake --build --preset RelWithDebInfoTests require iree-compile in PATH. The recommended approach is to create local
test presets in CMakeUserPresets.json (git-ignored) that set up the environment.
Example CMakeUserPresets.json:
{
"version": 4,
"testPresets": [
{
"name": "RelWithDebInfo-local",
"inherits": "RelWithDebInfo",
"environment": {
"PATH": "/path/to/iree/build/tools:$penv{PATH}"
}
}
]
}Then run tests with the local preset:
ctest --preset RelWithDebInfo-localAlternatively, set PATH manually before running tests:
export PATH="/path/to/iree/build/tools:$PATH"
ctest --preset RelWithDebInfoFor the experimental IR-based compilation pipeline:
# First build torch-mlir (one-time setup, see CLAUDE.md for details)
# Then:
cmake --preset RelWithDebInfo-MLIR
cmake --build --preset RelWithDebInfo-MLIR
ctest --preset RelWithDebInfo-MLIR-local#include <onnxruntime_cxx_api.h>
int main() {
Ort::InitApi(OrtGetApiBase()->GetApi(ORT_API_VERSION));
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");
// Register the hipDNN EP library
OrtStatus* status = Ort::GetApi().RegisterExecutionProviderLibrary(
env, "HipDNN", "/path/to/libhipdnn_ep.so");
if (status != nullptr) {
// Handle error
Ort::GetApi().ReleaseStatus(status);
return 1;
}
// Get available EP devices
std::vector<Ort::ConstEpDevice> devices = env.GetEpDevices();
// Find HipDNN device
const OrtEpDevice* hipdnn_device = nullptr;
for (const auto& device : devices) {
if (device.EpName() == "HipDNN") {
hipdnn_device = static_cast<const OrtEpDevice*>(device);
break;
}
}
// Create session options and append EP
Ort::GetApi().SessionOptionsAppendExecutionProvider_V2(
session_options, env, &hipdnn_device, 1, nullptr, nullptr, 0);
// Create session
Ort::Session session(env, "model.onnx", session_options);
// Run inference
// ...
return 0;
}This EP uses the ONNXRuntime Plugin EP V2 system, which allows:
- Building as a separate shared library
- Dynamic loading at runtime
- No modifications to ONNXRuntime source
- EP Factory (
HipDNNEpFactory): Creates EP instances and manages device discovery - EP (
HipDNNEp): Main execution provider, handles graph partitioning and compilation - Kernel (
Kernel): Routes to appropriate backend (hipDNN, hipBLAS-LT, or Torch-MLIR) - HipDNNGraph: Builds hipDNN graph from ONNX nodes for Conv2D
- BlasGraph: Builds hipBLAS-LT operations for MatMul/Gemm (currently disabled)
- IRBuilder: Torch-MLIR IR generation (experimental, when enabled)
- NodeComputeInfo: ORT callback interface for kernel lifecycle
- Allocator (
HipDeviceAllocator): HIP device memory allocation - Data Transfer (
HipDataTransfer): CPU <-> GPU data copies
The Kernel class automatically selects the appropriate backend:
- Torch-MLIR path (if enabled): Converts ONNX to Torch-MLIR IR for compilation
- hipDNN graph API: Used for Conv2D and MatMul/Gemm operations
- hipBLAS-LT (currently disabled): Alternative backend for MatMul/Gemm
When enabled, the Torch-MLIR path runs a 9-step compilation pipeline
(buildOffloadPipeline in passes.cc):
- onnx-to-torch — Convert
onnx.*ops totorch.aten.*ops - CSE — Deduplicate constants and identical list constructs
- offload — Outline supported
atenops intohipdnn.graphregions - canonicalize + CSE — Clean up dead ops, deduplicate cloned constants
- graph-to-executable — Compile
hipdnn.graphregions viairee-compile, replace withhipdnn.executableops - backend-legalize — Lower torch types to builtin tensors, convert
hipdnn.executableto DPShipdnn.execute - empty-tensor-elimination — Fold
tensor.emptyinto DPS destinations - one-shot-bufferize — Convert tensor program to memref program
- finalize-memrefs — Promote returned
memref.allocto function arguments (caller provides output buffers)
The final output is a function with memref-typed arguments for all inputs and
outputs, containing hipdnn.execute ops that reference pre-compiled graphs.
hipDNN uses a graph-based execution model:
- Build operation graph from ONNX nodes (conv_fprop, etc.)
- Validate and create execution plans
- Execute with variant pack (tensor uid -> device pointer mapping)
MIT License