PolyInfer

██████╗  ██████╗ ██╗  ██╗   ██╗██╗███╗   ██╗███████╗███████╗██████╗
██╔══██╗██╔═══██╗██║  ╚██╗ ██╔╝██║████╗  ██║██╔════╝██╔════╝██╔══██╗
██████╔╝██║   ██║██║   ╚████╔╝ ██║██╔██╗ ██║█████╗  █████╗  ██████╔╝
██╔═══╝ ██║   ██║██║    ╚██╔╝  ██║██║╚██╗██║██╔══╝  ██╔══╝  ██╔══██╗
██║     ╚██████╔╝███████╗██║   ██║██║ ╚████║██║     ███████╗██║  ██║
╚═╝      ╚═════╝ ╚══════╝╚═╝   ╚═╝╚═╝  ╚═══╝╚═╝     ╚══════╝╚═╝  ╚═╝

PolyInfer

Unified ML inference across multiple backends.

Installation

From PyPI (coming soon):

pip install polyinfer[nvidia]   # NVIDIA GPU (CUDA + cuDNN via onnxruntime-gpu)
pip install polyinfer[intel]    # Intel CPU/GPU/NPU
pip install polyinfer[amd]      # AMD GPU (Windows DirectML)
pip install polyinfer[cpu]      # CPU only
pip install polyinfer[all]      # Everything
pip install polyinfer[examples] # Dependencies for running examples (torch, PIL, etc.)

Native TensorRT (optional, for maximum performance):

# Install AFTER polyinfer[nvidia], then reinstall torch
pip install tensorrt-cu12 cuda-python
pip install torch torchvision --force-reinstall

From source (current):

git clone https://github.com/athrva98/polyinfer.git
cd polyinfer
pip install -e ".[nvidia]"      # Or any of the extras above

No manual CUDA/cuDNN installation required. Dependencies are automatically downloaded and configured. Works on Windows, Linux, WSL2, and Google Colab.

Quick Start

import polyinfer as pi

# List available backends and devices
print(pi.list_backends())  # ['onnxruntime', 'openvino']
print(pi.list_devices())   # [cpu, cuda, tensorrt, ...]

# Load model - auto-selects fastest backend
model = pi.load("model.onnx", device="cpu")        # Uses OpenVINO (fastest for CPU)
model = pi.load("model.onnx", device="cuda")       # Uses CUDA
model = pi.load("model.onnx", device="tensorrt")   # Uses TensorRT (450+ FPS on YoloV8n RTX5060!)

# Run inference
import numpy as np
output = model(np.random.rand(1, 3, 224, 224).astype(np.float32))

# Benchmark
results = model.benchmark(input_data, warmup=10, iterations=100)
print(f"{results['mean_ms']:.2f} ms ({results['fps']:.1f} FPS)")

Device Options

# CPU
model = pi.load("model.onnx", device="cpu")

# NVIDIA GPU
model = pi.load("model.onnx", device="cuda")       # CUDA
model = pi.load("model.onnx", device="cuda:0")     # Specific GPU
model = pi.load("model.onnx", device="tensorrt")   # TensorRT (generally the fastest for Nvidia)

# AMD/Intel/Any GPU on Windows
model = pi.load("model.onnx", device="directml")

# Vulkan (cross-platform GPU - AMD, NVIDIA, Intel, ARM, Qualcomm)
model = pi.load("model.onnx", device="vulkan")
model = pi.load("model.onnx", device="vulkan", vulkan_target="rdna3")  # AMD RX 7000
model = pi.load("model.onnx", device="vulkan", vulkan_target="ampere") # NVIDIA RTX 30

Backend Selection

# Auto-select (recommended)
model = pi.load("model.onnx", device="cuda")

# Explicit backend
model = pi.load("model.onnx", backend="onnxruntime", device="cuda")
model = pi.load("model.onnx", backend="openvino", device="cpu")

# TensorRT options:
model = pi.load("model.onnx", device="tensorrt")              # ONNX Runtime TensorRT EP (recommended)
model = pi.load("model.onnx", backend="tensorrt", device="cuda")  # Native TensorRT (requires separate install)

Compare Backends

# Compare all available backends
pi.compare("model.onnx", input_shape=(1, 3, 640, 640))

# For YOLO V8n
# Output:
# Backend Comparison for model.onnx
# ============================================================
# onnxruntime-tensorrt    :   2.22 ms (450.0 FPS) <-- FASTEST
# onnxruntime-cuda        :   6.64 ms (150.7 FPS)
# openvino-cpu            :  16.19 ms ( 61.8 FPS)
# onnxruntime-cpu         :  22.56 ms ( 44.3 FPS)

CLI

# Show system info and available backends
polyinfer info

# Benchmark a model
polyinfer benchmark model.onnx --device tensorrt

# Run inference
polyinfer run model.onnx --device cuda

Quantization

Reduce model size and improve inference speed with INT8/FP16 quantization:

import polyinfer as pi

# Dynamic quantization (no calibration data needed)
pi.quantize("model.onnx", "model_int8.onnx", method="dynamic")

# Static quantization with calibration data
calibration_data = [np.random.rand(1, 3, 224, 224).astype(np.float32) for _ in range(100)]
pi.quantize("model.onnx", "model_int8.onnx",
            method="static",
            calibration_data=calibration_data)

# FP16 conversion
pi.convert_to_fp16("model.onnx", "model_fp16.onnx")

# Load and run quantized model
model = pi.load("model_int8.onnx", device="cpu")
output = model(input_data)

Supported quantization:

ONNX Runtime: Dynamic/Static INT8, UINT8, INT4, FP16
OpenVINO (NNCF): Static INT8 with calibration
TensorRT: FP16/INT8 (via pi.load(..., fp16=True, int8=True))

Performance

YOLOv8n @ 640x640 (RTX 5060)

Backend	Latency	FPS	Speedup
TensorRT	2.2 ms	450	10x
CUDA	6.6 ms	151	3.4x
OpenVINO (CPU)	16.2 ms	62	1.4x
ONNX Runtime (CPU)	22.6 ms	44	1.0x

ResNet18 @ 224x224 (Google Colab T4)

Backend	Latency	FPS	Speedup
TensorRT	1.6 ms	639	2.6x
CUDA	4.1 ms	245	1.0x
ONNX Runtime (CPU)	43.7 ms	23	0.09x

Supported Backends

Backend	Devices	Install
ONNX Runtime	CPU, CUDA, TensorRT, DirectML	`[cpu]`, `[nvidia]`, `[amd]`
OpenVINO	CPU, Intel GPU, NPU	`[cpu]`, `[intel]`
IREE	CPU, Vulkan (AMD/NVIDIA/Intel/ARM/Qualcomm), CUDA	`[all]`, `[vulkan]`

MLIR Export (Custom Hardware)

Export models to MLIR for custom hardware targets, kernel injection, or advanced optimizations:

import polyinfer as pi

# Export ONNX to MLIR
mlir = pi.export_mlir("model.onnx", "model.mlir")

# Compile for specific target
vmfb = pi.compile_mlir("model.mlir", device="vulkan")

# Load and run
backend = pi.get_backend("iree")
model = backend.load_vmfb(vmfb, device="vulkan")

Why PolyInfer?

Zero configuration: pip install polyinfer[nvidia] - CUDA, cuDNN, TensorRT all auto-installed
Auto backend selection: Picks the fastest backend for your hardware
Unified API: Same code works across all backends
Real performance: 450 FPS with TensorRT, no manual optimization needed
MLIR support: Export to MLIR for custom hardware and kernel development

Development

Installation Options

Extra	What's Included	Use Case
`[nvidia]`	ONNX Runtime GPU, IREE, torch	NVIDIA GPUs
`[intel]`	OpenVINO, IREE, torch	Intel CPU, iGPU, NPU
`[amd]`	ONNX Runtime DirectML, IREE, torch	AMD GPUs on Windows
`[cpu]`	ONNX Runtime, OpenVINO, IREE, torch	CPU-only systems
`[vulkan]`	IREE, torch	Cross-platform GPU via Vulkan
`[all]`	Everything above	Maximum compatibility
`[tensorrt]`	tensorrt-cu12, cuda-python	Native TensorRT (install separately)
`[examples]`	PIL, opencv, transformers, diffusers, segment-anything	Running example scripts

Note: Native TensorRT is provided as a separate [tensorrt] extra because tensorrt-cu12-libs depends on cuda-toolkit which overwrites CUDA libraries and breaks PyTorch. Install it separately after [nvidia], then reinstall torch:

pip install polyinfer[nvidia]
pip install tensorrt-cu12 cuda-python  # Or: pip install polyinfer[tensorrt]
pip install torch torchvision --force-reinstall  # Fix torch after TensorRT install

Development Install

# Clone the repository
git clone https://github.com/athrva98/polyinfer.git
cd polyinfer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install in editable mode with dev dependencies
pip install -e ".[nvidia,dev]"

# Run tests
pytest tests/

Platform-Specific Setup

Windows

# Create conda environment (recommended)
conda create -n polyinfer python=3.11
conda activate polyinfer

# Clone and install with NVIDIA support
git clone https://github.com/athrva98/polyinfer.git
cd polyinfer
pip install -e ".[nvidia]"

# Or for AMD GPU
pip install -e ".[amd]"

# Verify installation
python -c "import polyinfer as pi; print(pi.list_devices())"

Linux / WSL2

# Create virtual environment
python3 -m venv ~/polyinfer_venv
source ~/polyinfer_venv/bin/activate

# Clone and install with NVIDIA support
git clone https://github.com/athrva98/polyinfer.git
cd polyinfer
pip install -e ".[nvidia]"

# Verify CUDA works
python -c "import polyinfer as pi; print(pi.list_devices())"

WSL2 GPU Passthrough Requirements:

Windows 11 or Windows 10 21H2+
WSL2 with Ubuntu (or other distro)
NVIDIA GPU driver installed on Windows (not in WSL)
No need to install CUDA in WSL, polyinfer handles it automatically

Google Colab

# Install polyinfer with NVIDIA support
!pip install -q "polyinfer[nvidia] @ git+https://github.com/athrva98/polyinfer.git"

# Verify installation
import polyinfer as pi
print(pi.list_devices())
# Output: [cpu, cuda, tensorrt, vulkan]

# TensorRT works out of the box on Colab!
model = pi.load("model.onnx", device="tensorrt")  # 638 FPS on ResNet18!

Note: TensorRT EP is automatically configured on Colab. The tensorrt-cu12-libs package provides TensorRT libraries, and PolyInfer automatically preloads them via ctypes before ONNX Runtime is imported.

macOS

# Clone repository
git clone https://github.com/athrva98/polyinfer.git
cd polyinfer

# Install CPU-only (no GPU acceleration on macOS yet)
pip install -e ".[cpu]"

# IREE provides some Metal support (experimental)
pip install -e ".[vulkan]"

Architecture

Backend Hierarchy

polyinfer/
├── __init__.py          # Public API: load, list_backends, export_mlir, etc.
├── model.py             # Model class with unified inference interface
├── discovery.py         # Backend/device discovery
├── nvidia_setup.py      # Auto-configures NVIDIA libraries (CUDA, cuDNN, TensorRT)
├── mlir.py              # MLIR export/compile functions
├── compare.py           # Cross-backend comparison utilities
├── config.py            # Configuration classes
└── backends/
    ├── base.py          # Abstract Backend and CompiledModel classes
    ├── registry.py      # Backend registration system
    ├── _autoload.py     # Auto-discovers and registers backends
    ├── onnxruntime/     # ONNX Runtime backend (CPU, CUDA, TensorRT, DirectML)
    ├── openvino/        # OpenVINO backend (Intel CPU, GPU, NPU)
    ├── tensorrt/        # Native TensorRT backend
    └── iree/            # IREE backend (CPU, Vulkan, CUDA) + MLIR emission

Backend Priority System

When you call pi.load("model.onnx", device="cuda"), polyinfer selects the best backend:

Backend	Priority	Devices
OpenVINO	70	cpu, intel-gpu, npu
ONNX Runtime	60	cpu, cuda, tensorrt, directml, rocm, coreml
TensorRT (native)	50	cuda, tensorrt
IREE	40	cpu, vulkan, cuda

Note: ONNX Runtime's TensorRT Execution Provider is preferred over native TensorRT because it works out-of-the-box without dependency conflicts. For device="tensorrt", ONNX Runtime's TensorRT EP is used by default. To use native TensorRT, specify backend="tensorrt" explicitly.

Device Normalization

These aliases are automatically normalized:

Input	Normalized To
`gpu`, `nvidia`	`cuda`
`trt`	`tensorrt`
`dml`	`directml`
`igpu`, `intel-igpu`	`intel-gpu`

Backend Options Reference

All backends support passing options through pi.load(). Options are passed as keyword arguments.

Native TensorRT Backend

For maximum NVIDIA performance. Supports full TensorRT configuration.

Requires separate installation:

pip install tensorrt-cu12 cuda-python
pip install torch torchvision --force-reinstall  # Fix torch after TensorRT install

model = pi.load("model.onnx", backend="tensorrt", device="cuda",
    # Precision
    fp16=True,                      # FP16 (half precision)
    int8=False,                     # INT8 quantization
    tf32=True,                      # TF32 on Ampere+ (default)
    bf16=False,                     # BF16 on Ada+
    fp8=False,                      # FP8 on Hopper+
    strict_types=False,             # Force specified precision

    # Optimization
    builder_optimization_level=5,   # 0-5, higher = better perf, slower build
    workspace_size=4 << 30,         # 4GB workspace
    avg_timing_iterations=4,        # More iterations = better kernel selection
    sparsity=False,                 # Structured sparsity (Ampere+)

    # Caching
    cache_path="./model.engine",    # Engine cache path
    timing_cache_path="./timing.cache",  # Timing cache for faster rebuilds
    force_rebuild=False,            # Ignore cache, rebuild engine

    # Hardware
    dla_core=-1,                    # DLA core (-1 = GPU only)
    gpu_fallback=True,              # GPU fallback for unsupported DLA ops

    # Profiling
    profiling_verbosity="detailed", # 'none', 'layer_names_only', 'detailed'
    engine_capability="default",    # 'default', 'safe', 'dla_standalone'

    # Dynamic shapes (for models with dynamic batch/resolution)
    min_shapes={"input": (1, 3, 224, 224)},
    opt_shapes={"input": (4, 3, 640, 640)},
    max_shapes={"input": (16, 3, 1024, 1024)},
)

ONNX Runtime Backend

Versatile backend with multiple execution providers.

CUDA Execution Provider

model = pi.load("model.onnx", device="cuda",
    # Session options
    graph_optimization_level=3,     # 0=off, 1=basic, 2=extended, 3=all
    intra_op_num_threads=4,         # Threads for parallelism
    inter_op_num_threads=2,
    enable_mem_pattern=True,
    enable_cpu_mem_arena=True,

    # CUDA-specific
    cuda_mem_limit=4 << 30,         # 4GB GPU memory limit
    arena_extend_strategy="kNextPowerOfTwo",
    cudnn_conv_algo_search="EXHAUSTIVE",  # or 'HEURISTIC', 'DEFAULT'
    do_copy_in_default_stream=True,
)

TensorRT Execution Provider (via ONNX Runtime)

model = pi.load("model.onnx", device="tensorrt",
    # Precision
    fp16=True,
    int8=False,

    # Optimization
    builder_optimization_level=5,   # 0-5
    max_workspace_size=4 << 30,     # 4GB
    timing_cache_path="./timing.cache",

    # Caching
    cache_dir="./trt_cache",        # Engine cache directory

    # Subgraph control
    min_subgraph_size=5,            # Min nodes for TRT subgraph
    max_partition_iterations=1000,

    # DLA (Jetson)
    dla_enable=False,
    dla_core=0,

    # Build options
    force_sequential_engine_build=False,
)

DirectML Execution Provider (Windows AMD/Intel GPU)

model = pi.load("model.onnx", device="directml",
    device_id=0,                    # GPU index
)

OpenVINO Backend

Optimized for Intel hardware.

model = pi.load("model.onnx", backend="openvino", device="cpu",
    optimization_level=2,           # 0=throughput, 1=balanced, 2=latency
    num_threads=8,                  # CPU threads
    enable_caching=True,
    cache_dir="./ov_cache",
)

IREE Backend

Cross-platform with MLIR export capability and comprehensive Vulkan GPU support.

# Basic Vulkan usage
model = pi.load("model.onnx", backend="iree", device="vulkan",
    opt_level=3,                    # 0-3
    cache_dir="./iree_cache",
    force_compile=False,
    save_mlir=True,                 # Save intermediate MLIR
)

# GPU-specific Vulkan targets for optimal performance
model = pi.load("model.onnx", backend="iree", device="vulkan",
    vulkan_target="rtx4090",        # NVIDIA RTX 4090 (uses sm_89)
    # vulkan_target="rdna3",        # AMD RX 7000 series
    # vulkan_target="ampere",       # NVIDIA RTX 30 series
    # vulkan_target="arc",          # Intel Arc
    # vulkan_target="adreno",       # Qualcomm mobile GPUs
    opt_level=3,
    data_tiling=True,               # Cache optimization
    opset_version=17,               # Upgrade ONNX opset if needed
)

# List all available Vulkan GPU targets
from polyinfer.backends.iree import VULKAN_TARGETS
for name, target in VULKAN_TARGETS.items():
    print(f"{name}: {target.description}")

# MLIR export for custom hardware
mlir = pi.export_mlir("model.onnx", "model.mlir", load_content=True)
vmfb = pi.compile_mlir("model.mlir", device="vulkan", vulkan_target="rdna3")

Supported Vulkan GPU Targets:

Vendor	Targets	GPUs
AMD	`rdna3`, `rdna2`, `gfx1100`, `rx7900xtx`, etc.	RX 7000/6000 series
NVIDIA	`ada`, `ampere`, `turing`, `sm_89`, `rtx4090`, etc.	RTX 40/30/20 series
Intel	`arc`, `arc_a770`, `arc_a750`	Arc A-series
ARM	`valhall4`, `valhall`, `mali_g715`	Mali GPUs
Qualcomm	`adreno`	Snapdragon mobile GPUs

Testing

Running Tests

# Run all tests
pytest tests/

# Run with verbose output
pytest tests/ -v

# Run specific test file
pytest tests/test_yolov8.py

# Run specific test class
pytest tests/test_yolov8.py::TestYOLOv8IREE

# Run tests for a specific device
pytest tests/ -m cuda
pytest tests/ -m vulkan
pytest tests/ -m npu

# Run benchmarks
pytest tests/test_benchmark.py -v

Test Markers

Tests are tagged with pytest markers for selective execution:

Marker	Description
`@pytest.mark.cuda`	Requires CUDA GPU
`@pytest.mark.tensorrt`	Requires TensorRT
`@pytest.mark.vulkan`	Requires Vulkan GPU
`@pytest.mark.directml`	Requires DirectML (Windows)
`@pytest.mark.openvino`	Requires OpenVINO
`@pytest.mark.iree`	Requires IREE
`@pytest.mark.npu`	Requires Intel NPU
`@pytest.mark.intel_gpu`	Requires Intel integrated GPU
`@pytest.mark.benchmark`	Performance benchmark tests

Test Files

File	Purpose
`test_backends.py`	Backend discovery and registration
`test_backend_options.py`	Backend options passthrough
`test_devices.py`	Device-specific loading and inference
`test_inference.py`	Cross-backend consistency
`test_benchmark.py`	Performance benchmarks
`test_yolov8.py`	End-to-end YOLOv8 tests on all devices
`test_mlir.py`	MLIR export and compilation
`test_intel_devices.py`	Intel GPU and NPU device tests

MLIR & Custom Hardware

PolyInfer can emit MLIR for custom hardware targets, kernel injection, and advanced optimization workflows.

Basic MLIR Workflow

import polyinfer as pi

# 1. Export ONNX to MLIR
mlir = pi.export_mlir("model.onnx", "model.mlir", load_content=True)
print(mlir.content[:500])  # Inspect MLIR

# 2. (Optional) Modify MLIR for custom kernels
# ... your custom transformations ...

# 3. Compile MLIR to executable
vmfb = pi.compile_mlir("model.mlir", device="vulkan")

# 4. Load and run
backend = pi.get_backend("iree")
model = backend.load_vmfb(vmfb, device="vulkan")
output = model(input_data)

MLIROutput Class

@dataclass
class MLIROutput:
    path: Path              # Path to saved MLIR file
    content: str | None     # MLIR content (if load_content=True)
    source_model: Path      # Original ONNX model path
    dialect: str            # MLIR dialect (e.g., "iree")

    def save(self, output_path) -> Path:
        """Save MLIR to a new location."""

    def __str__(self) -> str:
        """Returns MLIR content."""

Compilation Targets

Device	IREE Target	Use Case
`cpu`	`llvm-cpu`	CPU with LLVM optimizations
`vulkan`	`vulkan-spirv`	Cross-platform GPU (AMD, NVIDIA, Intel, ARM, Qualcomm)
`cuda`	`cuda`	NVIDIA GPU via CUDA

Vulkan GPU-Specific Compilation:

# Compile for specific GPU architecture
vmfb = pi.compile_mlir("model.mlir", device="vulkan",
    vulkan_target="rdna3",      # AMD RX 7000
    # vulkan_target="ampere",   # NVIDIA RTX 30
    # vulkan_target="arc",      # Intel Arc
    opt_level=3,
    data_tiling=True,
)

Troubleshooting

Common Issues

"Backend 'onnxruntime' does not support device 'cuda'"

Cause: onnxruntime-gpu is not installed, or conflicting ONNX Runtime packages.

PolyInfer will automatically detect this conflict and show a warning on import:

⚠️  ONNX Runtime Conflict Detected!
   Both 'onnxruntime-gpu' and 'onnxruntime-directml' are installed,
   but only DirectML is active. CUDA support is disabled.

Solution 1: Use the built-in fix helper

import polyinfer as pi
pi.fix_onnxruntime_conflict(prefer="cuda")  # or prefer="directml"
# Restart Python after running this

Solution 2: Manual fix

# Uninstall all onnxruntime variants
pip uninstall onnxruntime onnxruntime-gpu onnxruntime-directml -y

# Install the one you need
pip install onnxruntime-gpu  # For CUDA
# or
pip install onnxruntime-directml  # For DirectML (AMD on Windows)

Important: On Windows, you can only have ONE onnxruntime variant installed at a time. The packages onnxruntime, onnxruntime-gpu, and onnxruntime-directml share the same module namespace and will overwrite each other.

"libcudnn.so.9: cannot open shared object file"

Cause: cuDNN libraries not found by ONNX Runtime.

Solution: This should be automatic with polyinfer[nvidia]. If not:

# Check if libraries are detected
from polyinfer.nvidia_setup import get_nvidia_info
print(get_nvidia_info())

If libraries are found but still failing, the ctypes preload may not work for your setup. Set LD_LIBRARY_PATH manually:

export LD_LIBRARY_PATH=$(python -c "from polyinfer.nvidia_setup import get_nvidia_info; print(':'.join(get_nvidia_info()['library_directories']))")
python your_script.py

PyTorch breaks after installing TensorRT ("undefined symbol: ncclCommWindowRegister")

Cause: tensorrt-cu12-libs depends on cuda-toolkit, which overwrites CUDA libraries (nvidia-cuda-runtime, nvidia-nccl, etc.) with versions incompatible with PyTorch.

Solution: Reinstall PyTorch after installing TensorRT:

pip install torch torchvision --force-reinstall

Prevention: Use ONNX Runtime's TensorRT Execution Provider instead (works with device="tensorrt" by default). It provides similar performance without dependency conflicts. Only install native TensorRT if you need advanced TensorRT features.

TensorRT EP: "RegisterTensorRTPluginsAsCustomOps" error

Cause: ONNX Runtime can't find TensorRT libraries, even though TensorrtExecutionProvider shows as available.

Solution 1: Check if TensorRT libraries are installed

pip install tensorrt-cu12-libs  # Lightweight TensorRT libs (no cuda-python conflict)

Solution 2: Check library detection

import polyinfer as pi
info = pi.get_nvidia_info()
print("TensorRT dirs:", info['tensorrt_setup']['tensorrt_dirs'])
print("Preloaded libs:", info['tensorrt_setup']['preloaded_libs'])

If tensorrt_dirs is empty, PolyInfer couldn't find the TensorRT libraries. This usually means:

tensorrt-cu12-libs is not installed
The libraries are in an unexpected location

Solution 3: Manual preload (advanced)

import ctypes
from pathlib import Path
import sys

# Find tensorrt_libs directory
site_packages = Path(sys.prefix) / "lib" / f"python{sys.version_info.major}.{sys.version_info.minor}" / "dist-packages"
tensorrt_libs = site_packages / "tensorrt_libs"

# Preload before importing onnxruntime
for lib in ["libnvinfer.so.10", "libnvinfer_plugin.so.10", "libnvonnxparser.so.10"]:
    lib_path = tensorrt_libs / lib
    if lib_path.exists():
        ctypes.CDLL(str(lib_path), mode=ctypes.RTLD_GLOBAL)

# Now import polyinfer
import polyinfer as pi

"iree-import-onnx not found"

Cause: IREE compiler tools not installed.

Solution:

pip install iree-base-compiler[onnx] iree-base-runtime

IREE compilation fails with "failed to legalize operation"

Cause: ONNX operator not supported or opset version issue.

Solution:

# Try upgrading opset version
model = pi.load("model.onnx", backend="iree", device="vulkan",
    opset_version=17,  # Upgrade to opset 17
)

# Or manually upgrade the ONNX file
import onnx
model = onnx.load("model.onnx")
upgraded = onnx.version_converter.convert_version(model, 17)
onnx.save(upgraded, "model_v17.onnx")

Check IREE ONNX Op Support for operator compatibility.

Vulkan tests produce NaN values

Known Issue: IREE's Vulkan backend can produce sporadic NaN values (~0.01% of outputs) on some drivers.

Workaround: Tests are configured to tolerate up to 0.1% NaN values. For production, use CPU or CUDA backends for deterministic results.

WSL2: "CUDA not available" but nvidia-smi works

Cause: Python can't find CUDA libraries.

Solution: Ensure you installed with [nvidia]:

pip install polyinfer[nvidia]

The nvidia_setup.py module automatically configures library paths.

Debugging

Check Available Backends and Devices

import polyinfer as pi

print("Backends:", pi.list_backends())
print("Devices:")
for d in pi.list_devices():
    print(f"  {d.name}: {d.backends}")

Check NVIDIA Library Detection

from polyinfer.nvidia_setup import get_nvidia_info
import json
print(json.dumps(get_nvidia_info(), indent=2))

Check TensorRT Setup

import polyinfer as pi
info = pi.get_nvidia_info()

print("TensorRT Setup:")
print(f"  Configured: {info['tensorrt_setup']['configured']}")
print(f"  TensorRT dirs: {info['tensorrt_setup']['tensorrt_dirs']}")
print(f"  Preloaded libs: {info['tensorrt_setup']['preloaded_libs']}")

# Check if TensorRT EP is available
import onnxruntime as ort
providers = ort.get_available_providers()
print(f"\nONNX Runtime providers: {providers}")
print(f"TensorRT EP available: {'TensorrtExecutionProvider' in providers}")

Verbose ONNX Runtime Logging

import onnxruntime as ort
ort.set_default_logger_severity(0)  # 0=Verbose, 1=Info, 2=Warning, 3=Error

Check IREE Tools

# Check if IREE tools are available
which iree-import-onnx
which iree-compile

# Or in Python
from polyinfer.backends.iree.backend import _get_iree_import_onnx, _get_iree_compile
print("iree-import-onnx:", _get_iree_import_onnx())
print("iree-compile:", _get_iree_compile())

Contributing

Code Style

We use ruff for linting and formatting:

# Check code
ruff check src/ tests/

# Fix auto-fixable issues
ruff check --fix src/ tests/

# Format code
ruff format src/ tests/

Type Checking

mypy src/polyinfer/

Adding a New Backend

Create a new directory under src/polyinfer/backends/:

backends/
└── mybackend/
    ├── __init__.py
    └── backend.py

Implement the Backend and CompiledModel interfaces:

from polyinfer.backends.base import Backend, CompiledModel

class MyModel(CompiledModel):
    @property
    def backend_name(self) -> str:
        return "mybackend"

    @property
    def device(self) -> str:
        return self._device

    def __call__(self, *inputs):
        # Run inference
        pass

class MyBackend(Backend):
    @property
    def name(self) -> str:
        return "mybackend"

    @property
    def supported_devices(self) -> list[str]:
        return ["cpu", "custom-device"]

    def is_available(self) -> bool:
        # Check if backend can be used
        pass

    def load(self, model_path, device, **kwargs) -> MyModel:
        # Load and compile model
        pass

Register the backend in _autoload.py:

try:
    from polyinfer.backends.mybackend import MyBackend
    register_backend("mybackend", MyBackend)
except ImportError:
    pass

Add tests in tests/test_mybackend.py
Update pyproject.toml with optional dependencies

Pull Request Checklist

Tests pass (pytest tests/)
Code is formatted (ruff format)
No linting errors (ruff check)
Type hints added for public APIs
Documentation updated if needed
CHANGELOG updated (if applicable)

Performance Tips

Backend Selection

Use Case	Recommended Backend
Maximum NVIDIA performance	TensorRT
Good NVIDIA performance + compatibility	ONNX Runtime CUDA
Intel CPU optimization	OpenVINO
Intel GPU/NPU	OpenVINO
Cross-platform GPU (AMD/NVIDIA/Intel/ARM)	IREE Vulkan with GPU target
AMD GPU (Windows)	ONNX Runtime DirectML
AMD GPU (Linux)	IREE Vulkan (`vulkan_target="rdna3"`)
Mobile/Embedded (ARM Mali, Qualcomm Adreno)	IREE Vulkan

Benchmarking

import polyinfer as pi
import numpy as np

model = pi.load("model.onnx", device="cuda")
input_data = np.random.rand(1, 3, 640, 640).astype(np.float32)

# Warm up is important for GPU
bench = model.benchmark(input_data, warmup=50, iterations=200)

print(f"Mean: {bench['mean_ms']:.2f}ms")
print(f"Std:  {bench['std_ms']:.2f}ms")
print(f"Min:  {bench['min_ms']:.2f}ms")
print(f"Max:  {bench['max_ms']:.2f}ms")
print(f"FPS:  {bench['fps']:.1f}")

Memory Optimization

# Use specific device index to control GPU memory
model = pi.load("model.onnx", device="cuda:0")

# For ONNX Runtime, you can pass session options
model = pi.load("model.onnx", device="cuda",
                cuda_mem_limit=2 * 1024 * 1024 * 1024)  # 2GB

Author

Athrva Pandhare

License

Apache 2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
examples		examples
scripts		scripts
src/polyinfer		src/polyinfer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

athrva98/polyinfer

Folders and files

Latest commit

History

Repository files navigation

PolyInfer

Installation

Quick Start

Device Options

Backend Selection

Compare Backends

CLI

Quantization

Performance

YOLOv8n @ 640x640 (RTX 5060)

ResNet18 @ 224x224 (Google Colab T4)

Supported Backends

MLIR Export (Custom Hardware)

Why PolyInfer?

Development

Installation Options

Development Install

Platform-Specific Setup

Windows

Linux / WSL2

Google Colab

macOS

Architecture

Backend Hierarchy

Backend Priority System

Device Normalization

Backend Options Reference

Native TensorRT Backend

ONNX Runtime Backend

CUDA Execution Provider

TensorRT Execution Provider (via ONNX Runtime)

DirectML Execution Provider (Windows AMD/Intel GPU)

OpenVINO Backend

IREE Backend

Testing

Running Tests

Test Markers

Test Files

MLIR & Custom Hardware

Basic MLIR Workflow

MLIROutput Class

Compilation Targets

Troubleshooting

Common Issues

"Backend 'onnxruntime' does not support device 'cuda'"

"libcudnn.so.9: cannot open shared object file"

PyTorch breaks after installing TensorRT ("undefined symbol: ncclCommWindowRegister")

TensorRT EP: "RegisterTensorRTPluginsAsCustomOps" error

"iree-import-onnx not found"

IREE compilation fails with "failed to legalize operation"

Vulkan tests produce NaN values

WSL2: "CUDA not available" but nvidia-smi works

Debugging

Check Available Backends and Devices

Check NVIDIA Library Detection

Check TensorRT Setup

Verbose ONNX Runtime Logging

Check IREE Tools

Contributing

Code Style

Type Checking

Adding a New Backend

Pull Request Checklist

Performance Tips

Backend Selection

Benchmarking

Memory Optimization

Author

License

About

Topics

Resources

License

Uh oh!

Packages