|
| 1 | +## Table of Contents |
| 2 | + |
| 3 | +- [Overview](#overview) |
| 4 | +- [Prerequisites](#prerequisites) |
| 5 | +- [Accuracy Benchmarks](#accuracy-benchmarks) |
| 6 | + - [MMLU (Massive Multitask Language Understanding)](#mmlu-massive-multitask-language-understanding) |
| 7 | + - [Setup](#setup) |
| 8 | + - [Evaluation Methods](#evaluation-methods) |
| 9 | + - [1. Evaluate with ORT-DML using GenAI APIs](#1-evaluate-with-ort-dml-using-genai-apis) |
| 10 | + - [2. Evaluate with ORT-DML, CUDA, and CPU using Native ORT Path](#2-evaluate-with-ort-dml-cuda-and-cpu-using-native-ort-path) |
| 11 | + - [3. Evaluate the PyTorch Model of HF Weights](#3-evaluate-the-pytorch-model-of-hf-weights) |
| 12 | + - [4. Evaluate the TensorRT-LLM](#4-evaluate-the-tensorrt-llm) |
| 13 | + - [5. Evaluate the PyTorch Model Quantized with AutoAWQ](#5-evaluate-the-pytorch-model-quantized-with-autoawq) |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | +This repository provides scripts, popular third-party benchmarks, and instructions for evaluating the accuracy of Large Language Models (LLMs). It demonstrates how to use a ModelOpt quantized LLM with various established benchmarks, including deployment options using DirectML and TensorRT-LLM in a Windows environment. |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +| **Category** | **Details** | |
| 22 | +|:--------------------------|-------------------------------------------------------------------------------------------------------------| |
| 23 | +| **Operating System** | Windows 10 or later | |
| 24 | +| **Python** | - For ORT-DML GenAI, use Python 3.11. <br> - For TensorRT-LLM, use Python 3.10. <br> - All other backends are compatible with both Python 3.10 and 3.11. | |
| 25 | +| **Package Manager** | pip | |
| 26 | +| **Compatible Hardware and Drivers** | - Ensure necessary hardware (e.g., CUDA-compatible GPU) and drivers are installed, depending on the evaluation method: <br> - DirectML for DirectML-based evaluation <br> - CUDA for TensorRT | |
| 27 | +| **Additional Tools** | - **cmd**: Recommended for running the provided commands. <br> - **Tar Utility**: Included in Windows 10 and later via PowerShell. <br> - **Curl**: Included in Windows 10 and later via PowerShell. | |
| 28 | + |
| 29 | +# Accuracy Benchmarks |
| 30 | + |
| 31 | +## MMLU (Massive Multitask Language Understanding) |
| 32 | + |
| 33 | +The MMLU benchmark assesses LLM performance across a wide range of tasks, producing a score between 0 and 1, where a higher score indicates better accuracy. Please refer the [MMLU Paper](https://arxiv.org/abs/2009.03300) for more details on this. |
| 34 | + |
| 35 | +### MMLU Setup |
| 36 | + |
| 37 | +The table below lists the setup steps to prepare your environment for evaluating LLMs using the MMLU benchmark. |
| 38 | + |
| 39 | +| **Step** | **Command** or **Description** | |
| 40 | +|----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 41 | +| **Open PowerShell as Administrator** | - | |
| 42 | +| **Create and Activate a Virtual Environment** <br> _(Optional but Recommended)_ | `python -m venv llm_env` <br> `.\llm_env\Scripts\Activate.ps1` | |
| 43 | +| **Install PyTorch and Related Packages** | `pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124` | |
| 44 | +| **Install ONNX Runtime Packages** | `pip install onnxruntime-directml==1.20` <br> `pip install onnxruntime-genai-directml==0.4.0` | |
| 45 | +| **Install Benchmark Requirements** | `pip install -r requirements.txt` | |
| 46 | +| **Download MMLU Data** | `mkdir data` <br> `curl -o .\data\mmlu.tar https://people.eecs.berkeley.edu/~hendrycks/data.tar` <br> `tar -xf .\data\mmlu.tar -C .\data` <br> `Move-Item .\data\data .\data\mmlu` | |
| 47 | + |
| 48 | +### Evaluation Methods |
| 49 | + |
| 50 | +Once the MMLU benchmark is set up, you can use the `mmlu_benchmark.py` script to evaluate LLMs deployed with various backends. Please refer examples below. |
| 51 | + |
| 52 | +<details> |
| 53 | +<summary> MMLU Benchmark with GenAI APIs for ORT-DML Deployment</summary> |
| 54 | +<br> |
| 55 | + |
| 56 | +To run the model with ORT-DML using GenAI, use the `--ep genai_dml` argument. |
| 57 | + |
| 58 | +- **Test Suite** |
| 59 | + |
| 60 | + ```powershell |
| 61 | + python mmlu_benchmark.py ` |
| 62 | + --model_name causal ` |
| 63 | + --model_path <ONNX_model_folder> ` |
| 64 | + --ep genai_dml ` |
| 65 | + --output_file <output_log_file.json> ` |
| 66 | + --ntrain 5 |
| 67 | + ``` |
| 68 | + |
| 69 | +- **Specific Subjects** |
| 70 | + |
| 71 | + ```powershell |
| 72 | + python mmlu_benchmark.py ` |
| 73 | + --model_name causal ` |
| 74 | + --model_path <ONNX_model_folder> ` |
| 75 | + --ep genai_dml ` |
| 76 | + --output_file <output_log_file.json> ` |
| 77 | + --subject abstract_algebra,anatomy,college_mathematics ` |
| 78 | + --ntrain 5 |
| 79 | + ``` |
| 80 | + |
| 81 | +</details> |
| 82 | + |
| 83 | +<details> |
| 84 | +<summary>MMLU Benchmark with ONNX Runtime APIs for DML, CUDA, or CPU Deployment</summary> |
| 85 | +<br> |
| 86 | + |
| 87 | +To run the model with ORT-DML, ORT-CUDA or ORT-CPU execution providers, use `--ep ort_dml`, `--ep ort_cuda`, or `--ep ort_cpu` respectively. |
| 88 | + |
| 89 | +- **Test Suite** |
| 90 | + |
| 91 | + ```powershell |
| 92 | + python mmlu_benchmark.py ` |
| 93 | + --model_name causal ` |
| 94 | + --model_path <ONNX_model_folder> ` |
| 95 | + --ep ort_dml ` |
| 96 | + --output_file <output_log_file.json> ` |
| 97 | + --ntrain 5 |
| 98 | + ``` |
| 99 | + |
| 100 | +- **Specific Subjects** |
| 101 | + |
| 102 | + ```powershell |
| 103 | + python mmlu_benchmark.py ` |
| 104 | + --model_name causal ` |
| 105 | + --model_path <ONNX_model_folder> ` |
| 106 | + --ep ort_dml ` |
| 107 | + --output_file <output_log_file.json> ` |
| 108 | + --subject abstract_algebra,anatomy,college_mathematics ` |
| 109 | + --ntrain 5 |
| 110 | + ``` |
| 111 | + |
| 112 | +</details> |
| 113 | + |
| 114 | +<details> |
| 115 | +<summary>MMLU Benchmark with Transformer APIs for PyTorch Hugging Face Models</summary> |
| 116 | +<br> |
| 117 | + |
| 118 | +To evaluate the PyTorch Hugging Face (HF) model, use the `--ep pt` argument. |
| 119 | + |
| 120 | +- **Test Suite** |
| 121 | + |
| 122 | + ```powershell |
| 123 | + python mmlu_benchmark.py ` |
| 124 | + --model_name causal ` |
| 125 | + --model_path <ONNX_model_folder> ` |
| 126 | + --ep pt ` |
| 127 | + --output_file <output_log_file.json> ` |
| 128 | + --ntrain 5 ` |
| 129 | + --dtype <torch_dtype in model's config.json {float16|bfloat16}> |
| 130 | + ``` |
| 131 | + |
| 132 | +- **Specific Subjects** |
| 133 | + |
| 134 | + ```powershell |
| 135 | + python mmlu_benchmark.py ` |
| 136 | + --model_name causal ` |
| 137 | + --model_path <ONNX_model_folder> ` |
| 138 | + --ep pt ` |
| 139 | + --output_file <output_log_file.json> ` |
| 140 | + --subject abstract_algebra,anatomy,college_mathematics ` |
| 141 | + --ntrain 5 ` |
| 142 | + --dtype <torch_dtype in model's config.json {float16|bfloat16}> |
| 143 | + ``` |
| 144 | + |
| 145 | +</details> |
| 146 | + |
| 147 | +<details> |
| 148 | +<summary>MMLU Benchmark with TensorRT-LLM APIs for TensorRT-LLM Deployment</summary> |
| 149 | +<br> |
| 150 | + |
| 151 | +1. **Install TensorRT-LLM and Compatible PyTorch** |
| 152 | + |
| 153 | + ```powershell |
| 154 | + pip install torch==2.4.0+cu121 --index-url https://download.pytorch.org/whl |
| 155 | + pip install tensorrt_llm==0.12.0 ` |
| 156 | + --extra-index-url https://pypi.nvidia.com ` |
| 157 | + --extra-index-url https://download.pytorch.org/whl/cu121/torch/ |
| 158 | + ``` |
| 159 | + |
| 160 | +1. **Run the Benchmark** |
| 161 | + |
| 162 | + - **Test Suite** |
| 163 | + |
| 164 | + ```powershell |
| 165 | + python mmlu_benchmark.py ` |
| 166 | + --model_name causal ` |
| 167 | + --hf_model_dir <hf_model_path> ` |
| 168 | + --engine_dir <engine_path> ` |
| 169 | + --ep trt-llm ` |
| 170 | + --ntrain 5 ` |
| 171 | + --output_file result.json |
| 172 | + ``` |
| 173 | +
|
| 174 | + - **Specific Subjects** |
| 175 | +
|
| 176 | + ```powershell |
| 177 | + python mmlu_benchmark.py ` |
| 178 | + --model_name causal ` |
| 179 | + --hf_model_dir <hf_model_path> ` |
| 180 | + --engine_dir <engine_path> ` |
| 181 | + --ep trt-llm ` |
| 182 | + --ntrain 5 ` |
| 183 | + --output_file result.json ` |
| 184 | + --subject abstract_algebra,anatomy,college_mathematics |
| 185 | + ``` |
| 186 | +
|
| 187 | +</details> |
0 commit comments