ONNX Converter - ONNX Runtime GenAI Version

title	emoji	colorFrom	colorTo	sdk	app_port	pinned	license	short_description
Convert to ONNX	☯	indigo	yellow	docker	8501	true	apache-2.0	Convert a Hugging Face model to ONNX format

ONNX Converter - ONNX Runtime GenAI Version

Improved v2 based on huggingface/transformers.js#1361

The original ONNX converter does not support latest Qwen3 and Gemma models. This is a similar streamlit app based on the new ONNX builder script for newer models.

This is an updated version of the ONNX converter that uses the modern ONNX Runtime GenAI builder instead of the old Transformers.js approach. This enables support for newer models like Qwen3, Gemma3, Phi4, SmolLM3, and more.

What Changed?

Old Version (app.py)

Used Transformers.js conversion scripts
Limited to older model architectures (up to Qwen2)
Required bundled transformers.js repository

New Version (app.py)

Uses ONNX Runtime GenAI builder from Microsoft
Supports modern architectures including:
- Qwen3 ✨ (your primary need!)
- Gemma3 (text and multimodal)
- Phi4
- SmolLM3
- And many more (see full list below)
Direct integration with onnxruntime-genai package
More advanced quantization options

Supported Model Architectures

The new converter supports the following architectures:

Architecture	Model Family
`Qwen3ForCausalLM`	Qwen3 ✨
`Qwen2ForCausalLM`	Qwen2
`Qwen2_5_VLForConditionalGeneration`	Qwen2.5-VL
`Gemma3ForCausalLM`	Gemma 3
`Gemma2ForCausalLM`	Gemma 2
`GemmaForCausalLM`	Gemma
`Phi4MMForCausalLM`	Phi-4
`Phi3ForCausalLM`	Phi-3
`Phi3VForCausalLM`	Phi-3 Vision
`PhiMoEForCausalLM`	Phi-3 MoE
`SmolLM3ForCausalLM`	SmolLM3
`LlamaForCausalLM`	Llama
`MistralForCausalLM`	Mistral
`ChatGLMForConditionalGeneration`	ChatGLM
`GraniteForCausalLM`	Granite
`NemotronForCausalLM`	Nemotron
`OlmoForCausalLM`	OLMo
`Ernie4_5_ForCausalLM`	Ernie
`GptOssForCausalLM`	GPT-OSS

Installation

Option 1: Install from PyPI (Recommended)

pip install -r requirements.txt

Option 2: Install from Source

If you need the latest builder from the ONNX Runtime GenAI repository:

# Install base requirements
pip install huggingface_hub streamlit PyYAML torch transformers onnx

# Clone and install onnxruntime-genai
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai/src/python
pip install -e .

Usage

Running the Streamlit App

streamlit run app.py

Then:

Enter a Hugging Face model ID (e.g., Qwen/Qwen3-0.5B-Instruct)
Select a conversion preset (or choose Custom for manual configuration)
Select execution provider (cuda, cpu, dml, webgpu)
Configure advanced options (optional, when using Custom preset)
Click "Start Conversion"

Available Presets

The app includes 9 presets for common conversion scenarios:

FP16 - Recommended (GPU) - Default, best for most models
FP32 - Full Precision (CPU) - Maximum quality for CPU inference
BF16 - Brain Float (Gemma/Phi) - Optimized for Gemma and Phi models
INT4 - 4-bit Quantized - Smallest size for mobile/edge
INT4 + INT8 Activations - Balanced quantization
INT4 + BF16 Activations - Quantized Gemma/Phi optimization
INT4 + FP16 Activations - Standard GPU quantization
UINT4 - Asymmetric Quantization - Alternative quantization method
Custom - Manual Configuration - Full control over all settings

See PRESETS_GUIDE.md for detailed information about each preset.

Command Line Usage (Alternative)

If you have the builder installed, you can also use it directly:

python -m onnxruntime_genai.models.builder \
  -m Qwen/Qwen3-0.5B-Instruct \
  -o ./output \
  -p fp16 \
  -e cuda \
  -c ./cache_dir

Conversion Options

Presets (Recommended)

The app includes preset configurations that automatically set the optimal parameters:

Preset	Output Format	Best For
FP16 - Recommended	`model.onnx` (FP16)	Most models, NVIDIA GPUs
FP32 - Full Precision	`model.onnx` (FP32)	CPU inference, maximum accuracy
BF16 - Brain Float	`model.onnx` (BF16)	Gemma, Phi, modern GPUs
INT4 - 4-bit Quantized	`model.onnx` (INT4)	Mobile, edge devices
INT4 + INT8 Activations	`model.onnx` (INT4/INT8)	Balanced quantization
INT4 + BF16 Activations	`model.onnx` (INT4/BF16)	Gemma/Phi with quantization
INT4 + FP16 Activations	`model.onnx` (INT4/FP16)	Standard GPU quantization
UINT4 - Asymmetric	`model.onnx` (UINT4)	Alternative quantization

Note: All presets output to model.onnx (or decoder_model.onnx). The precision/quantization format is embedded in the file, not in the filename.

Manual Configuration (Advanced)

When selecting "Custom - Manual Configuration", you can configure:

Precision

fp16: Half precision (recommended for most GPUs)
fp32: Full precision (CPU or older GPUs)
bf16: BFloat16 (newer GPUs, better for Gemma models)
int4: 4-bit quantization (smallest size, faster inference)

Execution Providers

cuda: NVIDIA GPUs
cpu: CPU inference
dml: DirectML (Windows GPU)
webgpu: Web browsers with WebGPU support

Advanced Options (INT4 Quantization)

int4_block_size: Block size for quantization (16, 32, 64, 128, 256)
int4_is_symmetric: Symmetric (int4) vs asymmetric (uint4) quantization
int4_accuracy_level: Accuracy level (0-4, where 4=int8, 3=bf16, 2=fp16, 1=fp32)

Other Options

exclude_embeds: Remove embedding layer (use inputs_embeds instead)
exclude_lm_head: Remove LM head (output hidden_states instead)
enable_cuda_graph: Enable CUDA graph capture (CUDA only)

Example: Converting Qwen3

# This is what the Streamlit app does internally:

import onnxruntime_genai as og
from onnxruntime_genai.models.builder import create_model

create_model(
    model_name="Qwen/Qwen3-0.5B-Instruct",
    input_path="",  # Download from HF
    output_dir="./qwen3_onnx",
    precision="fp16",
    execution_provider="cuda",
    cache_dir="./cache",
    hf_token="your_token_here"
)

Using the Converted Model

After conversion, use the model with ONNX Runtime GenAI:

import onnxruntime_genai as og

# Load the model
model = og.Model("./qwen3_onnx")
tokenizer = og.Tokenizer(model)

# Generate text
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = tokens

generator = og.Generator(model, params)
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()

output_tokens = generator.get_sequence(0)
text = tokenizer.decode(output_tokens)
print(text)

Troubleshooting

Import Error: Cannot find onnxruntime-genai builder

Solution: Install onnxruntime-genai:

pip install onnxruntime-genai

Model Architecture Not Supported

Solution: Check if your model's architecture is in the supported list. You can check the architecture by looking at the model's config.json file on Hugging Face.

CUDA Out of Memory

Solution: Try using INT4 quantization or a smaller batch size:

Use --precision int4
Reduce model size by excluding components with exclude_embeds or exclude_lm_head

Gemma Models Lose Accuracy

Solution: Gemma models work better with BF16 precision:

--precision bf16

Or for INT4 quantization with BF16 I/O:

--precision int4 --extra_options use_cuda_bf16=true

Migration Guide

If you're migrating from the old app.py:

Old (Transformers.js)	New (ONNX Runtime GenAI)
`--quantize`	`--precision int4`
`--task <task>`	Auto-detected from model config
`--trust_remote_code`	`--extra_options hf_remote=true` (default)
`--output_attentions`	Not applicable (different architecture)
Output: `models/<model_id>/`	Output: Specified output directory

Resources

License

This converter uses the ONNX Runtime GenAI builder, which is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
.gitattributes		.gitattributes
COMPARISON.md		COMPARISON.md
Dockerfile		Dockerfile
INDEX.md		INDEX.md
PRESETS_GUIDE.md		PRESETS_GUIDE.md
PRESETS_QUICKREF.md		PRESETS_QUICKREF.md
PRESETS_SUMMARY.md		PRESETS_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
START_HERE.md		START_HERE.md
SUMMARY.md		SUMMARY.md
app.py		app.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.sh		setup.sh
test_setup.py		test_setup.py

haydenso/onnx-converter

Folders and files

Latest commit

History

Repository files navigation

ONNX Converter - ONNX Runtime GenAI Version

What Changed?

Old Version (app.py)

New Version (app.py)

Supported Model Architectures

Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install from Source

Usage

Running the Streamlit App

Available Presets

Command Line Usage (Alternative)

Conversion Options

Presets (Recommended)

Manual Configuration (Advanced)

Precision

Execution Providers

Advanced Options (INT4 Quantization)

Other Options

Example: Converting Qwen3

Using the Converted Model

Troubleshooting

Import Error: Cannot find onnxruntime-genai builder

Model Architecture Not Supported

CUDA Out of Memory

Gemma Models Lose Accuracy

Migration Guide

Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages