-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Summary
Add support for onnxruntime-genai as a backend in AIKit, using Olive for model preparation. This would complement the existing llama-cpp, diffusers, and vllm backends.
Background
Microsoft's Foundry Local uses onnxruntime-genai (MIT-licensed) as its inference engine, with Olive (also MIT-licensed) as the model optimization pipeline. Together, they provide an equivalent pipeline to llama.cpp:
| Step | llama.cpp | Olive + onnxruntime-genai |
|---|---|---|
| Convert model | convert_hf_to_gguf.py |
olive auto-opt |
| Quantize | llama-quantize |
Olive (GPTQ, AWQ, etc.) |
| Tokenize / KV cache / Inference / Sampling | Built-in | onnxruntime-genai + ONNX Runtime |
Why This Matters
The key advantage is hardware coverage. onnxruntime-genai supports execution providers that neither llama-cpp nor vllm can reach today:
| Execution Provider | Hardware |
|---|---|
| CPU | Any |
| CUDA | NVIDIA GPU |
| DirectML | AMD / Intel / NVIDIA on Windows |
| TensorRT-RTX | NVIDIA RTX |
| OpenVINO | Intel CPU / GPU / NPU |
| QNN | Qualcomm Snapdragon NPUs |
| WebGPU | Browser |
This opens AIKit to NPU-based edge deployments (Qualcomm, Intel), AMD GPUs (via DirectML), and Intel acceleration (via OpenVINO) — none of which are currently supported.
onnxruntime-genai already supports model architectures AIKit ships today: Llama, Phi, Qwen, Mistral, Gemma, DeepSeek, Whisper, and more.
Upstream Dependency
This requires a change in LocalAI first. LocalAI currently does not have an onnxruntime-genai backend. A new backend would need to be added to LocalAI that wraps onnxruntime-genai behind LocalAI's gRPC backend interface. Once that exists, AIKit can integrate it like the existing backends.
Potential approaches for the LocalAI backend:
- Python backend: Wrap the onnxruntime-genai Python API behind LocalAI's gRPC Python backend interface (similar to how
diffusersandvllmbackends work) - Go backend: Use the C/C++ API via CGo bindings (similar to how
whisperandpiperbackends work)
Model Preparation
Olive would be used during AIKit's build phase to convert and optimize HuggingFace models into ONNX format:
olive auto-opt \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--precision int4This could be integrated into AIKit's BuildKit pipeline, or users could provide pre-converted ONNX models.
References
- onnxruntime-genai (MIT)
- Olive (MIT)
- Foundry Local (uses these under the hood)
- LocalAI backends