Skip to content

Add Olive + onnxruntime-genai backend support #744

@sozercan

Description

@sozercan

Summary

Add support for onnxruntime-genai as a backend in AIKit, using Olive for model preparation. This would complement the existing llama-cpp, diffusers, and vllm backends.

Background

Microsoft's Foundry Local uses onnxruntime-genai (MIT-licensed) as its inference engine, with Olive (also MIT-licensed) as the model optimization pipeline. Together, they provide an equivalent pipeline to llama.cpp:

Step llama.cpp Olive + onnxruntime-genai
Convert model convert_hf_to_gguf.py olive auto-opt
Quantize llama-quantize Olive (GPTQ, AWQ, etc.)
Tokenize / KV cache / Inference / Sampling Built-in onnxruntime-genai + ONNX Runtime

Why This Matters

The key advantage is hardware coverage. onnxruntime-genai supports execution providers that neither llama-cpp nor vllm can reach today:

Execution Provider Hardware
CPU Any
CUDA NVIDIA GPU
DirectML AMD / Intel / NVIDIA on Windows
TensorRT-RTX NVIDIA RTX
OpenVINO Intel CPU / GPU / NPU
QNN Qualcomm Snapdragon NPUs
WebGPU Browser

This opens AIKit to NPU-based edge deployments (Qualcomm, Intel), AMD GPUs (via DirectML), and Intel acceleration (via OpenVINO) — none of which are currently supported.

onnxruntime-genai already supports model architectures AIKit ships today: Llama, Phi, Qwen, Mistral, Gemma, DeepSeek, Whisper, and more.

Upstream Dependency

This requires a change in LocalAI first. LocalAI currently does not have an onnxruntime-genai backend. A new backend would need to be added to LocalAI that wraps onnxruntime-genai behind LocalAI's gRPC backend interface. Once that exists, AIKit can integrate it like the existing backends.

Potential approaches for the LocalAI backend:

  • Python backend: Wrap the onnxruntime-genai Python API behind LocalAI's gRPC Python backend interface (similar to how diffusers and vllm backends work)
  • Go backend: Use the C/C++ API via CGo bindings (similar to how whisper and piper backends work)

Model Preparation

Olive would be used during AIKit's build phase to convert and optimize HuggingFace models into ONNX format:

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4

This could be integrated into AIKit's BuildKit pipeline, or users could provide pre-converted ONNX models.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions