| title | emoji | colorFrom | colorTo | sdk | app_port | pinned | license | short_description |
|---|---|---|---|---|---|---|---|---|
Convert to ONNX |
☯ |
indigo |
yellow |
docker |
8501 |
true |
apache-2.0 |
Convert a Hugging Face model to ONNX format |
Improved v2 based on huggingface/transformers.js#1361
The original ONNX converter does not support latest Qwen3 and Gemma models. This is a similar streamlit app based on the new ONNX builder script for newer models.
This is an updated version of the ONNX converter that uses the modern ONNX Runtime GenAI builder instead of the old Transformers.js approach. This enables support for newer models like Qwen3, Gemma3, Phi4, SmolLM3, and more.
- Used Transformers.js conversion scripts
- Limited to older model architectures (up to Qwen2)
- Required bundled transformers.js repository
- Uses ONNX Runtime GenAI builder from Microsoft
- Supports modern architectures including:
- Qwen3 ✨ (your primary need!)
- Gemma3 (text and multimodal)
- Phi4
- SmolLM3
- And many more (see full list below)
- Direct integration with onnxruntime-genai package
- More advanced quantization options
The new converter supports the following architectures:
| Architecture | Model Family |
|---|---|
Qwen3ForCausalLM |
Qwen3 ✨ |
Qwen2ForCausalLM |
Qwen2 |
Qwen2_5_VLForConditionalGeneration |
Qwen2.5-VL |
Gemma3ForCausalLM |
Gemma 3 |
Gemma2ForCausalLM |
Gemma 2 |
GemmaForCausalLM |
Gemma |
Phi4MMForCausalLM |
Phi-4 |
Phi3ForCausalLM |
Phi-3 |
Phi3VForCausalLM |
Phi-3 Vision |
PhiMoEForCausalLM |
Phi-3 MoE |
SmolLM3ForCausalLM |
SmolLM3 |
LlamaForCausalLM |
Llama |
MistralForCausalLM |
Mistral |
ChatGLMForConditionalGeneration |
ChatGLM |
GraniteForCausalLM |
Granite |
NemotronForCausalLM |
Nemotron |
OlmoForCausalLM |
OLMo |
Ernie4_5_ForCausalLM |
Ernie |
GptOssForCausalLM |
GPT-OSS |
pip install -r requirements.txtIf you need the latest builder from the ONNX Runtime GenAI repository:
# Install base requirements
pip install huggingface_hub streamlit PyYAML torch transformers onnx
# Clone and install onnxruntime-genai
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai/src/python
pip install -e .streamlit run app.pyThen:
- Enter a Hugging Face model ID (e.g.,
Qwen/Qwen3-0.5B-Instruct) - Select a conversion preset (or choose Custom for manual configuration)
- Select execution provider (cuda, cpu, dml, webgpu)
- Configure advanced options (optional, when using Custom preset)
- Click "Start Conversion"
The app includes 9 presets for common conversion scenarios:
- FP16 - Recommended (GPU) - Default, best for most models
- FP32 - Full Precision (CPU) - Maximum quality for CPU inference
- BF16 - Brain Float (Gemma/Phi) - Optimized for Gemma and Phi models
- INT4 - 4-bit Quantized - Smallest size for mobile/edge
- INT4 + INT8 Activations - Balanced quantization
- INT4 + BF16 Activations - Quantized Gemma/Phi optimization
- INT4 + FP16 Activations - Standard GPU quantization
- UINT4 - Asymmetric Quantization - Alternative quantization method
- Custom - Manual Configuration - Full control over all settings
See PRESETS_GUIDE.md for detailed information about each preset.
If you have the builder installed, you can also use it directly:
python -m onnxruntime_genai.models.builder \
-m Qwen/Qwen3-0.5B-Instruct \
-o ./output \
-p fp16 \
-e cuda \
-c ./cache_dirThe app includes preset configurations that automatically set the optimal parameters:
| Preset | Output Format | Best For |
|---|---|---|
| FP16 - Recommended | model.onnx (FP16) |
Most models, NVIDIA GPUs |
| FP32 - Full Precision | model.onnx (FP32) |
CPU inference, maximum accuracy |
| BF16 - Brain Float | model.onnx (BF16) |
Gemma, Phi, modern GPUs |
| INT4 - 4-bit Quantized | model.onnx (INT4) |
Mobile, edge devices |
| INT4 + INT8 Activations | model.onnx (INT4/INT8) |
Balanced quantization |
| INT4 + BF16 Activations | model.onnx (INT4/BF16) |
Gemma/Phi with quantization |
| INT4 + FP16 Activations | model.onnx (INT4/FP16) |
Standard GPU quantization |
| UINT4 - Asymmetric | model.onnx (UINT4) |
Alternative quantization |
Note: All presets output to model.onnx (or decoder_model.onnx). The precision/quantization format is embedded in the file, not in the filename.
When selecting "Custom - Manual Configuration", you can configure:
- fp16: Half precision (recommended for most GPUs)
- fp32: Full precision (CPU or older GPUs)
- bf16: BFloat16 (newer GPUs, better for Gemma models)
- int4: 4-bit quantization (smallest size, faster inference)
- cuda: NVIDIA GPUs
- cpu: CPU inference
- dml: DirectML (Windows GPU)
- webgpu: Web browsers with WebGPU support
- int4_block_size: Block size for quantization (16, 32, 64, 128, 256)
- int4_is_symmetric: Symmetric (int4) vs asymmetric (uint4) quantization
- int4_accuracy_level: Accuracy level (0-4, where 4=int8, 3=bf16, 2=fp16, 1=fp32)
- exclude_embeds: Remove embedding layer (use inputs_embeds instead)
- exclude_lm_head: Remove LM head (output hidden_states instead)
- enable_cuda_graph: Enable CUDA graph capture (CUDA only)
# This is what the Streamlit app does internally:
import onnxruntime_genai as og
from onnxruntime_genai.models.builder import create_model
create_model(
model_name="Qwen/Qwen3-0.5B-Instruct",
input_path="", # Download from HF
output_dir="./qwen3_onnx",
precision="fp16",
execution_provider="cuda",
cache_dir="./cache",
hf_token="your_token_here"
)After conversion, use the model with ONNX Runtime GenAI:
import onnxruntime_genai as og
# Load the model
model = og.Model("./qwen3_onnx")
tokenizer = og.Tokenizer(model)
# Generate text
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = tokens
generator = og.Generator(model, params)
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
output_tokens = generator.get_sequence(0)
text = tokenizer.decode(output_tokens)
print(text)Solution: Install onnxruntime-genai:
pip install onnxruntime-genaiSolution: Check if your model's architecture is in the supported list. You can check the architecture by looking at the model's config.json file on Hugging Face.
Solution: Try using INT4 quantization or a smaller batch size:
- Use
--precision int4 - Reduce model size by excluding components with
exclude_embedsorexclude_lm_head
Solution: Gemma models work better with BF16 precision:
--precision bf16Or for INT4 quantization with BF16 I/O:
--precision int4 --extra_options use_cuda_bf16=trueIf you're migrating from the old app.py:
| Old (Transformers.js) | New (ONNX Runtime GenAI) |
|---|---|
--quantize |
--precision int4 |
--task <task> |
Auto-detected from model config |
--trust_remote_code |
--extra_options hf_remote=true (default) |
--output_attentions |
Not applicable (different architecture) |
Output: models/<model_id>/ |
Output: Specified output directory |
This converter uses the ONNX Runtime GenAI builder, which is licensed under the MIT License.