Skip to content

Commit 091116d

Browse files
committed
[4975376]Add support for HF model perplexity calculation
Signed-off-by: unknown <[email protected]>
1 parent 71608d8 commit 091116d

File tree

4 files changed

+36
-10
lines changed

4 files changed

+36
-10
lines changed

examples/windows/accuracy_benchmark/perplexity_metrics/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
2525
- Install dependencies:
2626

2727
**For CUDA 12.x (recommended for CUDA 12.1-12.9):**
28+
2829
```bash
2930
pip install -r requirements.txt
3031
```
@@ -38,11 +39,13 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
3839
## Supported Models
3940

4041
### ONNX Runtime GenAI Models
42+
4143
- Any ONNX Runtime GenAI model exported with a compatible `genai_config.json` and tokenizer.
4244
- Supported architectures include: Gemma, Llama, Mistral, Phi (language + vision), Qwen.
4345
- Supported execution providers: CPU, DirectML, CUDA, NvTensorRtRtx.
4446

4547
### HuggingFace Models
48+
4649
- Any HuggingFace causal language model (e.g., `meta-llama/Llama-2-7b-hf`, `gpt2`, `mistralai/Mistral-7B-v0.1`).
4750
- Models are automatically downloaded from the HuggingFace Hub if not cached locally.
4851
- Supports custom data types (float16, bfloat16, float32) for efficient inference.
@@ -52,23 +55,27 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
5255
### Evaluate ONNX Models
5356

5457
#### Single Model
58+
5559
```bash
5660
python run_perplexity.py --models /path/to/model
5761
```
5862

5963
#### Multiple Models
64+
6065
```bash
6166
python run_perplexity.py --models /path/to/model1 /path/to/model2
6267
```
6368

6469
#### Custom Input Sequence Length(s)
70+
6571
You can specify the input sequence length(s) to evaluate using the `--i` argument:
6672

6773
```bash
6874
python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288
6975
```
7076

7177
#### Custom Prefill Chunk Size
78+
7279
You can specify the prefill chunk size to evaluate using the `--chunk_size` argument:
7380

7481
```bash
@@ -78,21 +85,25 @@ python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288 -
7885
### Evaluate HuggingFace Models
7986

8087
#### Basic HuggingFace Model Evaluation
88+
8189
```bash
8290
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --i 1024
8391
```
8492

8593
#### With Custom Data Type (Recommended for Performance)
94+
8695
```bash
8796
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024
8897
```
8998

9099
#### With Multiple Input Lengths
100+
91101
```bash
92102
python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024,2048,4096
93103
```
94104

95105
#### On CPU (if no GPU available)
106+
96107
```bash
97108
python run_perplexity.py --hf_model gpt2 --hf_device cpu --i 1024
98109
```
@@ -189,6 +200,7 @@ Set `DEBUG = True` in `perplexity_metrics.py` for detailed logs.
189200
## Common Use Cases
190201

191202
### Compare ONNX vs. HuggingFace Model
203+
192204
Verify that your ONNX exported model has similar perplexity to the original HuggingFace model:
193205

194206
```bash
@@ -201,11 +213,13 @@ python run_perplexity.py \
201213
```
202214

203215
### Evaluate Small Models (for quick testing)
216+
204217
```bash
205218
python run_perplexity.py --hf_model gpt2 --hf_dtype float16 --i 1024
206219
```
207220

208221
### Benchmark Multiple Quantization Variants
222+
209223
```bash
210224
python run_perplexity.py \
211225
--models /path/to/fp16_model /path/to/int8_model /path/to/int4_model \

examples/windows/accuracy_benchmark/perplexity_metrics/perplexity_metrics.py

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -114,9 +114,9 @@ def calculate_perplexity_hf(
114114
print(f"[INFO] Full input length: {seq_len}")
115115
print(f"[INFO] max_length: {max_length}, stride: {stride}")
116116

117-
max_eval_length = seq_len
117+
max_eval_length = seq_len
118118

119-
# Initialize accumulators for log probabilities (same as ONNX version)
119+
# Initialize accumulators for log probabilities
120120
total_log_probs = 0.0
121121
total_token_count = 0
122122
prev_end_loc = 0
@@ -127,14 +127,15 @@ def calculate_perplexity_hf(
127127
trg_len = end_loc - prev_end_loc
128128

129129
if DEBUG:
130-
print(f"\n[LOOP] chunk_idx={chunk_idx} [begin={begin_loc} end={end_loc}] trg_len={trg_len}")
130+
print(
131+
f"\n[LOOP] chunk_idx={chunk_idx} [begin={begin_loc} end={end_loc}] trg_len={trg_len}"
132+
)
131133

132134
# Extract the current chunk of input tokens (keep on CPU until needed)
133135
input_ids_chunk = input_ids[:, begin_loc:end_loc].to(device)
134136
target_ids = input_ids_chunk.clone()
135137

136138
# Mask context tokens: only predict for last trg_len tokens in chunk
137-
# This matches the ONNX version logic
138139
mask = np.ones(target_ids.shape, dtype=bool)
139140
mask[:, :-trg_len] = False
140141
target_ids_masked = target_ids.clone()
@@ -155,7 +156,7 @@ def calculate_perplexity_hf(
155156
if DEBUG:
156157
print(f"[LOGITS] Shape: {logits.shape}, dtype: {logits.dtype}")
157158

158-
# Compute log probabilities over vocabulary for each position (same as ONNX)
159+
# Compute log probabilities over vocabulary for each position
159160
log_probs = torch.nn.functional.log_softmax(logits, dim=2).cpu().numpy()
160161
chunk_seq_len = log_probs.shape[1]
161162

@@ -197,12 +198,22 @@ def calculate_perplexity_hf(
197198
total_token_count += int(valid_log_probs.size)
198199

199200
if DEBUG:
200-
print(f"[LOOP] This chunk: valid tokens={valid_log_probs.size}, sum={np.sum(valid_log_probs)}")
201+
print(
202+
f"[LOOP] This chunk: valid tokens={valid_log_probs.size}, sum={np.sum(valid_log_probs)}"
203+
)
201204
print(f"[TALLY] total_log_probs: {total_log_probs}")
202205
print(f"[TALLY] total_token_count: {total_token_count}")
203206

204207
# Clear GPU cache to prevent OOM
205-
del outputs, logits, log_probs, pred_log_probs, input_ids_chunk, target_ids, target_ids_masked
208+
del (
209+
outputs,
210+
logits,
211+
log_probs,
212+
pred_log_probs,
213+
input_ids_chunk,
214+
target_ids,
215+
target_ids_masked,
216+
)
206217
if device == "cuda":
207218
torch.cuda.empty_cache()
208219

examples/windows/accuracy_benchmark/perplexity_metrics/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# PyTorch with CUDA 12.x support (compatible with CUDA 12.1-12.9)
22
--extra-index-url https://download.pytorch.org/whl/cu129
3+
accelerate
34

45
coloredlogs
56
datasets
@@ -15,9 +16,8 @@ pytest
1516
sentencepiece
1617
sympy
1718
torch>=2.0.0
18-
torchvision
1919
torchaudio
20+
torchvision
2021
transformers
21-
accelerate
2222

2323

examples/windows/accuracy_benchmark/perplexity_metrics/run_perplexity.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ def run_perplexity_on_models(
116116
"Error": "None",
117117
}
118118
)
119-
except Exception as e:
119+
except Exception as e: # noqa: PERF203
120120
print(f" Error for input length {input_len}: {e!s}")
121121
results.append(
122122
{
@@ -134,6 +134,7 @@ def run_perplexity_on_models(
134134
# Unload HuggingFace model from GPU memory before ONNX evaluation
135135
print("[CLEANUP] Unloading HuggingFace model from GPU memory...")
136136
import gc
137+
137138
import torch
138139

139140
if torch.cuda.is_available():

0 commit comments

Comments
 (0)