Skip to content

MaheshRavishankar/hipDNNEP

Repository files navigation

hipDNN Execution Provider for ONNXRuntime

An out-of-tree Execution Provider for ONNXRuntime that uses AMD's hipDNN library for accelerated inference on AMD GPUs.

Status

Work in Progress - This is a prototype implementation.

Supported Operations

  • Conv2D - via hipDNN graph API
  • MatMul/Gemm - via hipDNN graph API

Optional Features

  • hipBLAS-LT support - Currently disabled. When re-enabled, provides an alternative MatMul/Gemm backend via hipBLAS-LT.
  • Torch-MLIR integration - Experimental IR-based compilation pipeline. Enable with HIPDNN_EP_ENABLE_TORCH_MLIR=ON.

Tested Dependency Versions

Dependency Commit
TheRock 9639502b
IREE db9d11e4

Prerequisites

  • CMake 3.20+
  • Ninja build system
  • HIP SDK (from TheRock)
  • hipDNN library (from TheRock)
  • hipBLAS-LT (optional, from TheRock) - alternative MatMul/Gemm backend (currently disabled)
  • ONNXRuntime (source and built library)
  • iree-compile (required by hipDNN backend for code generation)
  • Python 3 with onnx package (for test model generation)

Building

1. Set Environment Variables

export THEROCK_DIST="/path/to/TheRock/build/dist/rocm"
export ONNXRUNTIME_ROOT="/path/to/onnxruntime"

2. Configure and Build

cd hipDNNEP

# Configure
cmake --preset RelWithDebInfo

# Build
cmake --build --preset RelWithDebInfo

3. Run Tests

Tests require iree-compile in PATH. The recommended approach is to create local test presets in CMakeUserPresets.json (git-ignored) that set up the environment.

Example CMakeUserPresets.json:

{
  "version": 4,
  "testPresets": [
    {
      "name": "RelWithDebInfo-local",
      "inherits": "RelWithDebInfo",
      "environment": {
        "PATH": "/path/to/iree/build/tools:$penv{PATH}"
      }
    }
  ]
}

Then run tests with the local preset:

ctest --preset RelWithDebInfo-local

Alternatively, set PATH manually before running tests:

export PATH="/path/to/iree/build/tools:$PATH"
ctest --preset RelWithDebInfo

4. Build with Torch-MLIR (Optional)

For the experimental IR-based compilation pipeline:

# First build torch-mlir (one-time setup, see CLAUDE.md for details)
# Then:
cmake --preset RelWithDebInfo-MLIR
cmake --build --preset RelWithDebInfo-MLIR
ctest --preset RelWithDebInfo-MLIR-local

Usage

Loading the EP in ONNXRuntime

#include <onnxruntime_cxx_api.h>

int main() {
    Ort::InitApi(OrtGetApiBase()->GetApi(ORT_API_VERSION));
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example");

    // Register the hipDNN EP library
    OrtStatus* status = Ort::GetApi().RegisterExecutionProviderLibrary(
        env, "HipDNN", "/path/to/libhipdnn_ep.so");

    if (status != nullptr) {
        // Handle error
        Ort::GetApi().ReleaseStatus(status);
        return 1;
    }

    // Get available EP devices
    std::vector<Ort::ConstEpDevice> devices = env.GetEpDevices();

    // Find HipDNN device
    const OrtEpDevice* hipdnn_device = nullptr;
    for (const auto& device : devices) {
        if (device.EpName() == "HipDNN") {
            hipdnn_device = static_cast<const OrtEpDevice*>(device);
            break;
        }
    }

    // Create session options and append EP
    Ort::GetApi().SessionOptionsAppendExecutionProvider_V2(
        session_options, env, &hipdnn_device, 1, nullptr, nullptr, 0);

    // Create session
    Ort::Session session(env, "model.onnx", session_options);

    // Run inference
    // ...

    return 0;
}

Architecture

This EP uses the ONNXRuntime Plugin EP V2 system, which allows:

  • Building as a separate shared library
  • Dynamic loading at runtime
  • No modifications to ONNXRuntime source

Key Components

  1. EP Factory (HipDNNEpFactory): Creates EP instances and manages device discovery
  2. EP (HipDNNEp): Main execution provider, handles graph partitioning and compilation
  3. Kernel (Kernel): Routes to appropriate backend (hipDNN, hipBLAS-LT, or Torch-MLIR)
  4. HipDNNGraph: Builds hipDNN graph from ONNX nodes for Conv2D
  5. BlasGraph: Builds hipBLAS-LT operations for MatMul/Gemm (currently disabled)
  6. IRBuilder: Torch-MLIR IR generation (experimental, when enabled)
  7. NodeComputeInfo: ORT callback interface for kernel lifecycle
  8. Allocator (HipDeviceAllocator): HIP device memory allocation
  9. Data Transfer (HipDataTransfer): CPU <-> GPU data copies

Backend Selection

The Kernel class automatically selects the appropriate backend:

  1. Torch-MLIR path (if enabled): Converts ONNX to Torch-MLIR IR for compilation
  2. hipDNN graph API: Used for Conv2D and MatMul/Gemm operations
  3. hipBLAS-LT (currently disabled): Alternative backend for MatMul/Gemm

Torch-MLIR Offload Pipeline

When enabled, the Torch-MLIR path runs a 9-step compilation pipeline (buildOffloadPipeline in passes.cc):

  1. onnx-to-torch — Convert onnx.* ops to torch.aten.* ops
  2. CSE — Deduplicate constants and identical list constructs
  3. offload — Outline supported aten ops into hipdnn.graph regions
  4. canonicalize + CSE — Clean up dead ops, deduplicate cloned constants
  5. graph-to-executable — Compile hipdnn.graph regions via iree-compile, replace with hipdnn.executable ops
  6. backend-legalize — Lower torch types to builtin tensors, convert hipdnn.executable to DPS hipdnn.execute
  7. empty-tensor-elimination — Fold tensor.empty into DPS destinations
  8. one-shot-bufferize — Convert tensor program to memref program
  9. finalize-memrefs — Promote returned memref.alloc to function arguments (caller provides output buffers)

The final output is a function with memref-typed arguments for all inputs and outputs, containing hipdnn.execute ops that reference pre-compiled graphs.

hipDNN Integration

hipDNN uses a graph-based execution model:

  1. Build operation graph from ONNX nodes (conv_fprop, etc.)
  2. Validate and create execution plans
  3. Execute with variant pack (tensor uid -> device pointer mapping)

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors