Skip to content

Commit 0a85401

Browse files
committed
[4975376]Add support for HF model perplexity calculation
Signed-off-by: unknown <[email protected]>
1 parent 5a4a41e commit 0a85401

File tree

4 files changed

+484
-45
lines changed

4 files changed

+484
-45
lines changed

examples/windows/accuracy_benchmark/perplexity_metrics/README.md

Lines changed: 114 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Overview
44

5-
This tool evaluates the perplexity of ONNX Runtime GenAI models using the [WikiText-2](https://huggingface.co/datasets/wikitext) dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.
5+
This tool evaluates the perplexity of ONNX Runtime GenAI models and HuggingFace models using the [WikiText-2](https://huggingface.co/datasets/wikitext) dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.
66

77
## Attribution
88

@@ -11,6 +11,7 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
1111
- Multiple context lengths
1212
- Configurable chunk sizes
1313
- Enhanced prefill chunking handling
14+
- HuggingFace model evaluation support
1415

1516
## Scripts
1617

@@ -20,8 +21,10 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
2021
## Requirements
2122

2223
- Python 3.8+
24+
- CUDA 12.x (if using GPU acceleration)
2325
- Install dependencies:
2426

27+
**For CUDA 12.x (recommended for CUDA 12.1-12.9):**
2528
```bash
2629
pip install -r requirements.txt
2730
```
@@ -34,53 +37,96 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
3437

3538
## Supported Models
3639

40+
### ONNX Runtime GenAI Models
3741
- Any ONNX Runtime GenAI model exported with a compatible `genai_config.json` and tokenizer.
3842
- Supported architectures include: Gemma, Llama, Mistral, Phi (language + vision), Qwen.
3943
- Supported execution providers: CPU, DirectML, CUDA, NvTensorRtRtx.
4044

45+
### HuggingFace Models
46+
- Any HuggingFace causal language model (e.g., `meta-llama/Llama-2-7b-hf`, `gpt2`, `mistralai/Mistral-7B-v0.1`).
47+
- Models are automatically downloaded from the HuggingFace Hub if not cached locally.
48+
- Supports custom data types (float16, bfloat16, float32) for efficient inference.
49+
4150
## How to Run
4251

43-
### Evaluate a Single Model
52+
### Evaluate ONNX Models
4453

54+
#### Single Model
4555
```bash
4656
python run_perplexity.py --models /path/to/model
4757
```
4858

49-
### Evaluate multiple models
50-
59+
#### Multiple Models
5160
```bash
5261
python run_perplexity.py --models /path/to/model1 /path/to/model2
5362
```
5463

55-
### Custom input sequence length(s)
56-
57-
You can specify the input sequence length(s) to evaluate using the `--i` argument.
58-
For example, to evaluate with input lengths:
64+
#### Custom Input Sequence Length(s)
65+
You can specify the input sequence length(s) to evaluate using the `--i` argument:
5966

6067
```bash
6168
python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288
6269
```
6370

64-
### Custom prefill chunk size
65-
66-
You can specify the prefill chunk size to evaluate using the `--chunk_size` argument.
67-
For example:
71+
#### Custom Prefill Chunk Size
72+
You can specify the prefill chunk size to evaluate using the `--chunk_size` argument:
6873

6974
```bash
7075
python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288 --chunk_size=1024
7176
```
7277

73-
### Custom output file
78+
### Evaluate HuggingFace Models
79+
80+
#### Basic HuggingFace Model Evaluation
81+
```bash
82+
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --i 1024
83+
```
84+
85+
#### With Custom Data Type (Recommended for Performance)
86+
```bash
87+
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024
88+
```
89+
90+
#### With Multiple Input Lengths
91+
```bash
92+
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024,2048,4096
93+
```
94+
95+
#### On CPU (if no GPU available)
96+
```bash
97+
python run_perplexity.py --hf_model gpt2 --hf_device cpu --i 1024
98+
```
99+
100+
### Evaluate Both ONNX and HuggingFace Models Together
101+
102+
Compare ONNX and HuggingFace models side-by-side:
103+
104+
```bash
105+
python run_perplexity.py \
106+
--models /path/to/onnx_model \
107+
--hf_model meta-llama/Llama-2-7b-hf \
108+
--hf_dtype float16 \
109+
--i 1024 \
110+
--output comparison_results.csv
111+
```
112+
113+
### HuggingFace Model Arguments
114+
115+
- `--hf_model`: HuggingFace model name or local path (e.g., `meta-llama/Llama-2-7b-hf`)
116+
- `--hf_device`: Device to run on (`cuda`, `cpu`, `cuda:0`, etc.) - default: `cuda`
117+
- `--hf_dtype`: Data type for model weights - options: `float16`, `bfloat16`, `float32`, `fp16`, `bf16`, `fp32` - default: model default (usually float32)
118+
119+
### Custom Output File
74120

75121
```bash
76122
python run_perplexity.py --models /path/to/model --output results.csv
77123
```
78124

79-
## Expected output
125+
## Expected Output
80126

81127
Expected scores often fall between 2 and 1000; lower is better. See ranges below.
82128

83-
### Perplexity configuration setting
129+
### Perplexity Configuration Setting (for ONNX models)
84130

85131
- If **kv_chunking** is enabled in the model configuration (i.e., `"chunk_size"` is present in the `"search"` section of `genai_config.json`), then:
86132
- `max_input_seq_length` is set to **8192**
@@ -89,36 +135,82 @@ Expected scores often fall between 2 and 1000; lower is better. See ranges below
89135
- `max_input_seq_length` is **1024**
90136
- `stride` is **512**
91137

92-
### Console output
138+
### For HuggingFace Models
139+
140+
- Default `max_length` is **1024**
141+
- Default `stride` is **512** (or `chunk_size` if specified)
142+
143+
### Console Output
93144

94145
```text
95146
============================================================
96-
Evaluating perplexity for: /path/to/model
147+
Evaluating HuggingFace model: meta-llama/Llama-2-7b-hf
148+
============================================================
149+
[INFO] Loading Wikitext-2 'test' split ...
150+
[TOKENIZER] Tokenizing ...
151+
152+
[RESULT] Perplexity of meta-llama/Llama-2-7b-hf: 5.47
153+
154+
HuggingFace perplexity evaluation completed
155+
156+
============================================================
157+
Evaluating perplexity for: /path/to/onnx_model
97158
============================================================
98159
[INFO] Loading Wikitext-2 'test' split ...
99160
[TOKENIZER] Tokenizing ...
100161
101-
[RESULT] Perplexity of /path/to/model: 45.28
162+
[RESULT] Perplexity of /path/to/onnx_model: 5.48
102163
103164
Perplexity evaluation completed successfully
104165
```
105166

106-
### CSV output
167+
### CSV Output
107168

108169
Generated file contains:
109170

110-
- Model Path
171+
- Model Path (model directory or HuggingFace model name)
172+
- Model Type (ONNX or HuggingFace)
173+
- Input Length
111174
- Perplexity score
112175
- Status (Success/Failed)
113176
- Error details (if any)
114177

115-
## Debug mode
178+
## Debug Mode
116179

117180
Set `DEBUG = True` in `perplexity_metrics.py` for detailed logs.
118181

119-
## Typical perplexity ranges
182+
## Typical Perplexity Ranges
120183

121184
- Excellent: 2-20
122185
- Good: 20-40
123186
- OK: 40-80
124187
- Poor: 100+
188+
189+
## Common Use Cases
190+
191+
### Compare ONNX vs. HuggingFace Model
192+
Verify that your ONNX exported model has similar perplexity to the original HuggingFace model:
193+
194+
```bash
195+
python run_perplexity.py \
196+
--models /path/to/exported_onnx_model \
197+
--hf_model meta-llama/Llama-2-7b-hf \
198+
--hf_dtype float16 \
199+
--i 1024 \
200+
--output validation_results.csv
201+
```
202+
203+
### Evaluate Small Models (for quick testing)
204+
```bash
205+
python run_perplexity.py --hf_model gpt2 --hf_dtype float16 --i 1024
206+
```
207+
208+
### Benchmark Multiple Quantization Variants
209+
```bash
210+
python run_perplexity.py \
211+
--models /path/to/fp16_model /path/to/int8_model /path/to/int4_model \
212+
--hf_model original/model-name \
213+
--hf_dtype float16 \
214+
--i 2048 \
215+
--output quantization_comparison.csv
216+
```

0 commit comments

Comments
 (0)