Perplexity Evaluation Tool

Overview

This tool evaluates the perplexity of ONNX Runtime GenAI models and HuggingFace models using the WikiText-2 dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.

Attribution

This script is originally based on perplexity_metrics.py from the Microsoft ONNX Runtime GenAI repository. It has been modified to handle:

Multiple context lengths
Configurable chunk sizes
Enhanced prefill chunking handling
HuggingFace model evaluation support

Scripts

perplexity_metrics.py: Core evaluation logic for computing perplexity.
run_perplexity.py: Command-line utility for evaluating one or more models and saving results to CSV.

Requirements

Python 3.8+
CUDA 12.x (if using GPU acceleration)
Install dependencies:

For CUDA 12.x (recommended for CUDA 12.1-12.9):
```
pip install -r requirements.txt
```
Install ONNX Runtime GenAI (required for ONNX model evaluation):
```
pip install onnxruntime-genai
```
HuggingFace CLI login is required to access the WikiText-2 dataset:
```
huggingface-cli login
```

Supported Models

ONNX Runtime GenAI Models

Any ONNX Runtime GenAI model exported with a compatible genai_config.json and tokenizer.
Supported architectures include: Gemma, Llama, Mistral, Phi (language + vision), Qwen.
Supported execution providers: CPU, DirectML, CUDA, NvTensorRtRtx.

HuggingFace Models

Any HuggingFace causal language model (e.g., meta-llama/Llama-2-7b-hf, gpt2, mistralai/Mistral-7B-v0.1).
Models are automatically downloaded from the HuggingFace Hub if not cached locally.
Supports custom data types (float16, bfloat16, float32) for efficient inference.

How to Run

Evaluate ONNX Models

Single Model

python run_perplexity.py --models /path/to/model

Multiple Models

python run_perplexity.py --models /path/to/model1 /path/to/model2

Custom Input Sequence Length(s)

You can specify the input sequence length(s) to evaluate using the --i argument:

python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288

Custom Prefill Chunk Size

You can specify the prefill chunk size to evaluate using the --chunk_size argument:

python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288 --chunk_size=1024

Evaluate HuggingFace Models

Basic HuggingFace Model Evaluation

python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --i 1024

With Custom Data Type (Recommended for Performance)

python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024

With Multiple Input Lengths

python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024,2048,4096

On CPU (if no GPU available)

python run_perplexity.py --hf_model gpt2 --hf_device cpu --i 1024

Evaluate Both ONNX and HuggingFace Models Together

Compare ONNX and HuggingFace models side-by-side:

python run_perplexity.py \
  --models /path/to/onnx_model \
  --hf_model meta-llama/Llama-2-7b-hf \
  --hf_dtype float16 \
  --i 1024 \
  --output comparison_results.csv

HuggingFace Model Arguments

--hf_model: HuggingFace model name or local path (e.g., meta-llama/Llama-2-7b-hf)
--hf_device: Device to run on (cuda, cpu, cuda:0, etc.) - default: cuda
--hf_dtype: Data type for model weights - options: float16, bfloat16, float32, fp16, bf16, fp32 - default: model default (usually float32)

Custom Output File

python run_perplexity.py --models /path/to/model --output results.csv

Expected Output

Expected scores often fall between 2 and 1000; lower is better. See ranges below.

Perplexity Configuration Setting (for ONNX models)

If kv_chunking is enabled in the model configuration (i.e., "chunk_size" is present in the "search" section of genai_config.json), then:
- max_input_seq_length is set to 8192
- stride is set to the value of chunk_size
If kv_chunking is not enabled (default):
- max_input_seq_length is 1024
- stride is 512

For HuggingFace Models

Default max_length is 1024
Default stride is 512 (or chunk_size if specified)

Console Output

============================================================
Evaluating HuggingFace model: meta-llama/Llama-2-7b-hf
============================================================
[INFO] Loading Wikitext-2 'test' split ...
[TOKENIZER] Tokenizing ...

[RESULT] Perplexity of meta-llama/Llama-2-7b-hf: 5.47

HuggingFace perplexity evaluation completed

============================================================
Evaluating perplexity for: /path/to/onnx_model
============================================================
[INFO] Loading Wikitext-2 'test' split ...
[TOKENIZER] Tokenizing ...

[RESULT] Perplexity of /path/to/onnx_model: 5.48

Perplexity evaluation completed successfully

CSV Output

Generated file contains:

Model Path (model directory or HuggingFace model name)
Model Type (ONNX or HuggingFace)
Input Length
Perplexity score
Status (Success/Failed)
Error details (if any)

Debug Mode

Set DEBUG = True in perplexity_metrics.py for detailed logs.

Typical Perplexity Ranges

Excellent: 2-20
Good: 20-40
OK: 40-80
Poor: 100+

Common Use Cases

Compare ONNX vs. HuggingFace Model

Verify that your ONNX exported model has similar perplexity to the original HuggingFace model:

python run_perplexity.py \
  --models /path/to/exported_onnx_model \
  --hf_model meta-llama/Llama-2-7b-hf \
  --hf_dtype float16 \
  --i 1024 \
  --output validation_results.csv

Evaluate Small Models (for quick testing)

python run_perplexity.py --hf_model gpt2 --hf_dtype float16 --i 1024

Benchmark Multiple Quantization Variants

python run_perplexity.py \
  --models /path/to/fp16_model /path/to/int8_model /path/to/int4_model \
  --hf_model original/model-name \
  --hf_dtype float16 \
  --i 2048 \
  --output quantization_comparison.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perplexity Evaluation Tool

Overview

Attribution

Scripts

Requirements

Supported Models

ONNX Runtime GenAI Models

HuggingFace Models

How to Run

Evaluate ONNX Models

Single Model

Multiple Models

Custom Input Sequence Length(s)

Custom Prefill Chunk Size

Evaluate HuggingFace Models

Basic HuggingFace Model Evaluation

With Custom Data Type (Recommended for Performance)

With Multiple Input Lengths

On CPU (if no GPU available)

Evaluate Both ONNX and HuggingFace Models Together

HuggingFace Model Arguments

Custom Output File

Expected Output

Perplexity Configuration Setting (for ONNX models)

For HuggingFace Models

Console Output

CSV Output

Debug Mode

Typical Perplexity Ranges

Common Use Cases

Compare ONNX vs. HuggingFace Model

Evaluate Small Models (for quick testing)

Benchmark Multiple Quantization Variants

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Perplexity Evaluation Tool

Overview

Attribution

Scripts

Requirements

Supported Models

ONNX Runtime GenAI Models

HuggingFace Models

How to Run

Evaluate ONNX Models

Single Model

Multiple Models

Custom Input Sequence Length(s)

Custom Prefill Chunk Size

Evaluate HuggingFace Models

Basic HuggingFace Model Evaluation

With Custom Data Type (Recommended for Performance)

With Multiple Input Lengths

On CPU (if no GPU available)

Evaluate Both ONNX and HuggingFace Models Together

HuggingFace Model Arguments

Custom Output File

Expected Output

Perplexity Configuration Setting (for ONNX models)

For HuggingFace Models

Console Output

CSV Output

Debug Mode

Typical Perplexity Ranges

Common Use Cases

Compare ONNX vs. HuggingFace Model

Evaluate Small Models (for quick testing)

Benchmark Multiple Quantization Variants