NVIDIA
diff --git a/‎examples/windows/accuracy_benchmark/perplexity_metrics/README.md‎
Lines changed: 114 additions & 22 deletions b/‎examples/windows/accuracy_benchmark/perplexity_metrics/README.md‎
Lines changed: 114 additions & 22 deletions
@@ -2,7 +2,7 @@
 
 ## Overview
 
-This tool evaluates the perplexity of ONNX Runtime GenAI models using the [WikiText-2](https://huggingface.co/datasets/wikitext) dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.
+This tool evaluates the perplexity of ONNX Runtime GenAI models and HuggingFace models using the [WikiText-2](https://huggingface.co/datasets/wikitext) dataset. Perplexity is a standard metric for language models: lower values indicate better predictive performance.
 
 ## Attribution
 
@@ -11,6 +11,7 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
 - Multiple context lengths
 - Configurable chunk sizes
 - Enhanced prefill chunking handling
+- HuggingFace model evaluation support
 
 ## Scripts
 
@@ -20,8 +21,10 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
 ## Requirements
 
 - Python 3.8+
+- CUDA 12.x (if using GPU acceleration)
 - Install dependencies:
 
+  **For CUDA 12.x (recommended for CUDA 12.1-12.9):**
   ```bash
   pip install -r requirements.txt
   ```
@@ -34,53 +37,96 @@ This script is originally based on [perplexity_metrics.py](https://github.com/mi
 
 ## Supported Models
 
+### ONNX Runtime GenAI Models
 - Any ONNX Runtime GenAI model exported with a compatible `genai_config.json` and tokenizer.
 - Supported architectures include: Gemma, Llama, Mistral, Phi (language + vision), Qwen.
 - Supported execution providers: CPU, DirectML, CUDA, NvTensorRtRtx.
 
+### HuggingFace Models
+- Any HuggingFace causal language model (e.g., `meta-llama/Llama-2-7b-hf`, `gpt2`, `mistralai/Mistral-7B-v0.1`).
+- Models are automatically downloaded from the HuggingFace Hub if not cached locally.
+- Supports custom data types (float16, bfloat16, float32) for efficient inference.
+
 ## How to Run
 
-### Evaluate a Single Model
+### Evaluate ONNX Models
 
+#### Single Model
 ```bash
 python run_perplexity.py --models /path/to/model
 ```
 
-### Evaluate multiple models
-
+#### Multiple Models
 ```bash
 python run_perplexity.py --models /path/to/model1 /path/to/model2
 ```
 
-### Custom input sequence length(s)
-
-You can specify the input sequence length(s) to evaluate using the `--i` argument.  
-For example, to evaluate with input lengths:
+#### Custom Input Sequence Length(s)
+You can specify the input sequence length(s) to evaluate using the `--i` argument:
 
 ```bash
 python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288
 ```
 
-### Custom prefill chunk size
-
-You can specify the prefill chunk size to evaluate using the `--chunk_size` argument.  
-For example:
+#### Custom Prefill Chunk Size
+You can specify the prefill chunk size to evaluate using the `--chunk_size` argument:
 
 ```bash
 python run_perplexity.py --models /path/to/model --i 1024,2048,4096,8192,12288 --chunk_size=1024
 ```
 
-### Custom output file
+### Evaluate HuggingFace Models
+
+#### Basic HuggingFace Model Evaluation
+```bash
+python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --i 1024
+```
+
+#### With Custom Data Type (Recommended for Performance)
+```bash
+python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024
+```
+
+#### With Multiple Input Lengths
+```bash
+python run_perplexity.py --hf_model meta-llama/Llama-2-7b-hf --hf_dtype float16 --i 1024,2048,4096
+```
+
+#### On CPU (if no GPU available)
+```bash
+python run_perplexity.py --hf_model gpt2 --hf_device cpu --i 1024
+```
+
+### Evaluate Both ONNX and HuggingFace Models Together
+
+Compare ONNX and HuggingFace models side-by-side:
+
+```bash
+python run_perplexity.py \
+  --models /path/to/onnx_model \
+  --hf_model meta-llama/Llama-2-7b-hf \
+  --hf_dtype float16 \
+  --i 1024 \
+  --output comparison_results.csv
+```
+
+### HuggingFace Model Arguments
+
+- `--hf_model`: HuggingFace model name or local path (e.g., `meta-llama/Llama-2-7b-hf`)
+- `--hf_device`: Device to run on (`cuda`, `cpu`, `cuda:0`, etc.) - default: `cuda`
+- `--hf_dtype`: Data type for model weights - options: `float16`, `bfloat16`, `float32`, `fp16`, `bf16`, `fp32` - default: model default (usually float32)
+
+### Custom Output File
 
 ```bash
 python run_perplexity.py --models /path/to/model --output results.csv
 ```
 
-## Expected output
+## Expected Output
 
 Expected scores often fall between 2 and 1000; lower is better. See ranges below.
 
-### Perplexity configuration setting
+### Perplexity Configuration Setting (for ONNX models)
 
 - If **kv_chunking** is enabled in the model configuration (i.e., `"chunk_size"` is present in the `"search"` section of `genai_config.json`), then:
   - `max_input_seq_length` is set to **8192**
@@ -89,36 +135,82 @@ Expected scores often fall between 2 and 1000; lower is better. See ranges below
   - `max_input_seq_length` is **1024**
   - `stride` is **512**
 
-### Console output
+### For HuggingFace Models
+
+- Default `max_length` is **1024**
+- Default `stride` is **512** (or `chunk_size` if specified)
+
+### Console Output
 
 ```text
 ============================================================
-Evaluating perplexity for: /path/to/model
+Evaluating HuggingFace model: meta-llama/Llama-2-7b-hf
+============================================================
+[INFO] Loading Wikitext-2 'test' split ...
+[TOKENIZER] Tokenizing ...
+
+[RESULT] Perplexity of meta-llama/Llama-2-7b-hf: 5.47
+
+HuggingFace perplexity evaluation completed
+
+============================================================
+Evaluating perplexity for: /path/to/onnx_model
 ============================================================
 [INFO] Loading Wikitext-2 'test' split ...
 [TOKENIZER] Tokenizing ...
 
-[RESULT] Perplexity of /path/to/model: 45.28
+[RESULT] Perplexity of /path/to/onnx_model: 5.48
 
 Perplexity evaluation completed successfully
 ```
 
-### CSV output
+### CSV Output
 
 Generated file contains:
 
-- Model Path
+- Model Path (model directory or HuggingFace model name)
+- Model Type (ONNX or HuggingFace)
+- Input Length
 - Perplexity score
 - Status (Success/Failed)
 - Error details (if any)
 
-## Debug mode
+## Debug Mode
 
 Set `DEBUG = True` in `perplexity_metrics.py` for detailed logs.
 
-## Typical perplexity ranges
+## Typical Perplexity Ranges
 
 - Excellent: 2-20
 - Good: 20-40
 - OK: 40-80
 - Poor: 100+
+
+## Common Use Cases
+
+### Compare ONNX vs. HuggingFace Model
+Verify that your ONNX exported model has similar perplexity to the original HuggingFace model:
+
+```bash
+python run_perplexity.py \
+  --models /path/to/exported_onnx_model \
+  --hf_model meta-llama/Llama-2-7b-hf \
+  --hf_dtype float16 \
+  --i 1024 \
+  --output validation_results.csv
+```
+
+### Evaluate Small Models (for quick testing)
+```bash
+python run_perplexity.py --hf_model gpt2 --hf_dtype float16 --i 1024
+```
+
+### Benchmark Multiple Quantization Variants
+```bash
+python run_perplexity.py \
+  --models /path/to/fp16_model /path/to/int8_model /path/to/int4_model \
+  --hf_model original/model-name \
+  --hf_dtype float16 \
+  --i 2048 \
+  --output quantization_comparison.csv
+```