This tool evaluates the perplexity of ONNX Runtime GenAI models and HuggingFace models using the WikiText-2 dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.
This script is originally based on perplexity_metrics.py from the Microsoft ONNX Runtime GenAI repository. It has been modified to handle:
- Multiple context lengths
- Configurable chunk sizes
- Enhanced prefill chunking handling
- HuggingFace model evaluation support
perplexity_metrics.py: Core evaluation logic for computing perplexity.run_perplexity.py: Command-line utility for evaluating one or more models and saving results to CSV.
-
Python 3.8+
-
CUDA 12.x (if using GPU acceleration)
-
Install dependencies:
For CUDA 12.x (recommended for CUDA 12.1-12.9):
pip install -r requirements.txt
-
Install ONNX Runtime GenAI (required for ONNX model evaluation):
pip install onnxruntime-genai
-
HuggingFace CLI login is required to access the WikiText-2 dataset:
huggingface-cli login
- Any ONNX Runtime GenAI model exported with a compatible
genai_config.jsonand tokenizer. - Supported architectures include: Gemma, Llama, Mistral, Phi (language + vision), Qwen.
- Supported execution providers: CPU, DirectML, CUDA, NvTensorRtRtx.
- Any HuggingFace causal language model (e.g.,
meta-llama/Llama-2-7b-hf,gpt2,mistralai/Mistral-7B-v0.1). - Models are automatically downloaded from the HuggingFace Hub if not cached locally.
- Supports custom data types (float16, bfloat16, float32) for efficient inference.
python run_perplexity.py --models /path/to/modelpython run_perplexity.py --models /path/to/model1 /path/to/model2You can specify the input sequence length(s) to evaluate using the --i argument:
python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288You can specify the prefill chunk size to evaluate using the --chunk_size argument:
python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288 --chunk_size=1024python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --i 1024python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024,2048,4096python run_perplexity.py --hf_model gpt2 --hf_device cpu --i 1024Compare ONNX and HuggingFace models side-by-side:
python run_perplexity.py \
--models /path/to/onnx_model \
--hf_model meta-llama/Llama-2-7b-hf \
--hf_dtype float16 \
--i 1024 \
--output comparison_results.csv--hf_model: HuggingFace model name or local path (e.g.,meta-llama/Llama-2-7b-hf)--hf_device: Device to run on (cuda,cpu,cuda:0, etc.) - default:cuda--hf_dtype: Data type for model weights - options:float16,bfloat16,float32,fp16,bf16,fp32- default: model default (usually float32)
python run_perplexity.py --models /path/to/model --output results.csvExpected scores often fall between 2 and 1000; lower is better. See ranges below.
- If kv_chunking is enabled in the model configuration (i.e.,
"chunk_size"is present in the"search"section ofgenai_config.json), then:max_input_seq_lengthis set to 8192strideis set to the value ofchunk_size
- If kv_chunking is not enabled (default):
max_input_seq_lengthis 1024strideis 512
- Default
max_lengthis 1024 - Default
strideis 512 (orchunk_sizeif specified)
============================================================
Evaluating HuggingFace model: meta-llama/Llama-2-7b-hf
============================================================
[INFO] Loading Wikitext-2 'test' split ...
[TOKENIZER] Tokenizing ...
[RESULT] Perplexity of meta-llama/Llama-2-7b-hf: 5.47
HuggingFace perplexity evaluation completed
============================================================
Evaluating perplexity for: /path/to/onnx_model
============================================================
[INFO] Loading Wikitext-2 'test' split ...
[TOKENIZER] Tokenizing ...
[RESULT] Perplexity of /path/to/onnx_model: 5.48
Perplexity evaluation completed successfully
Generated file contains:
- Model Path (model directory or HuggingFace model name)
- Model Type (ONNX or HuggingFace)
- Input Length
- Perplexity score
- Status (Success/Failed)
- Error details (if any)
Set DEBUG = True in perplexity_metrics.py for detailed logs.
- Excellent: 2-20
- Good: 20-40
- OK: 40-80
- Poor: 100+
Verify that your ONNX exported model has similar perplexity to the original HuggingFace model:
python run_perplexity.py \
--models /path/to/exported_onnx_model \
--hf_model meta-llama/Llama-2-7b-hf \
--hf_dtype float16 \
--i 1024 \
--output validation_results.csvpython run_perplexity.py --hf_model gpt2 --hf_dtype float16 --i 1024python run_perplexity.py \
--models /path/to/fp16_model /path/to/int8_model /path/to/int4_model \
--hf_model original/model-name \
--hf_dtype float16 \
--i 2048 \
--output quantization_comparison.csv