|
| 1 | +# MLPerf Inference reference implementation for GPT-OSS-120B |
| 2 | +This is the reference implementation for GPT-OSS-120B. This is a proposal and is a WIP. |
| 3 | + |
| 4 | +## Model and Dataset download |
| 5 | + |
| 6 | +#### TODO: Replace this with mlc download link when available |
| 7 | + |
| 8 | +* Model: `openai/gpt-oss-120b`, commit id: [`b5c939d`](https://huggingface.co/openai/gpt-oss-120b/tree/b5c939de8f754692c1647ca79fbf85e8c1e70f8a) |
| 9 | +* Dataset: Please request access at [this link](https://drive.google.com/drive/folders/1DCfEXHqe69okrqKbSyV-8VUw413JqpPY?usp=drive_link) - **this is a tentative dataset** |
| 10 | + |
| 11 | +Datasets are now provided in **Parquet format** (recommended) for better performance and smaller file size (50% smaller than pickle). Pickle format is still supported for backward compatibility. |
| 12 | + |
| 13 | +## Environment setup |
| 14 | +Work on reference implementation is done using the sglang containers at [https://hub.docker.com/r/lmsysorg/sglang/tags](https://hub.docker.com/r/lmsysorg/sglang/tags). For enroot setup, a script is provided under [`setup_enroot.sh`](./setup_enroot.sh). For all sections below, we shall assume this environment is instantiated. |
| 15 | + |
| 16 | +Once in the environment, install additional requirements using [`setup.sh`](./setup.sh): |
| 17 | +```bash |
| 18 | +./setup.sh |
| 19 | +``` |
| 20 | + |
| 21 | +## Running the reference implementation: SGLang |
| 22 | +Use [`./sglang/run_server.sh`](./sglang/run_server.sh) to launch an SGLang server hosting `gpt-oss-120b`. |
| 23 | + |
| 24 | +### Run the server |
| 25 | +```bash |
| 26 | +./run_server.sh \ |
| 27 | + --model_path path/to/gpt-oss-120b/model \ |
| 28 | + --dp N \ |
| 29 | + --stream_interval 100 \ |
| 30 | + --eagle_path optional/path/to/eagle/head |
| 31 | +``` |
| 32 | +The script uses `python3 -m sglang.launch_server` tp instantiate the model, with `tp=pp=ep=1`, and `dp` as specified. |
| 33 | + |
| 34 | +You may also use docker: |
| 35 | +```bash |
| 36 | +docker run --runtime nvidia --gpus all --net host \ |
| 37 | + -v ${HF_HOME}:/root/.cache/huggingface \ |
| 38 | + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ |
| 39 | + --ipc=host lmsysorg/sglang:latest \ |
| 40 | + python3 -m sglang.launch_server --model-path ${MODEL_NAME} \ |
| 41 | + --host 0.0.0.0 --port 3000 --data-parallel-size=1 --max-running-requests 512 \ |
| 42 | + --mem-fraction-static 0.85 --chunked-prefill-size 16384 --ep-size=1 \ |
| 43 | + --enable-metrics --stream-interval 500 |
| 44 | +``` |
| 45 | + |
| 46 | +Then, run a benchmark script that uses the client to send/recv requests. |
| 47 | +### Run the inference |
| 48 | + |
| 49 | +**Note:** All scripts now support both Parquet (`.parquet`) and Pickle (`.pkl`) formats for dataset files. Parquet is recommended as it offers: |
| 50 | +- 50% smaller file size |
| 51 | +- Faster loading times |
| 52 | +- Cross-language compatibility |
| 53 | +- Type-safe schema preservation |
| 54 | + |
| 55 | +Example usage: |
| 56 | +```bash |
| 57 | +# first, install loadgen |
| 58 | +pip install $(git rev-parse --show-toplevel)/loadgen |
| 59 | + |
| 60 | +# Using Parquet format (recommended) |
| 61 | +python3 run_mlperf.py \ |
| 62 | + --scenario offline \ |
| 63 | + --input-file /path/to/dataset.parquet \ |
| 64 | + --accuracy |
| 65 | + |
| 66 | +# Using Pickle format (backward compatible) |
| 67 | +python3 run_mlperf.py \ |
| 68 | + --scenario offline \ |
| 69 | + --input-file /path/to/dataset.pkl \ |
| 70 | + --accuracy |
| 71 | +``` |
| 72 | + |
| 73 | +Full command-line options: |
| 74 | +```bash |
| 75 | +python3 run_mlperf.py --help |
| 76 | +usage: run_mlperf.py [-h] [--scenario {offline,server}] --input-file INPUT_FILE [--max-samples MAX_SAMPLES] [--mlperf-conf MLPERF_CONF] |
| 77 | + [--user-conf USER_CONF] [--accuracy] [--output-dir OUTPUT_DIR] [--backend {sglang}] [--server-url SERVER_URL] |
| 78 | + [--generation-config GENERATION_CONFIG] [--max-new-tokens MAX_NEW_TOKENS] [--num-workers NUM_WORKERS] |
| 79 | + [--max-concurrency MAX_CONCURRENCY] |
| 80 | + |
| 81 | +Run MLPerf inference benchmarks for gpt-oss |
| 82 | + |
| 83 | +options: |
| 84 | + -h, --help show this help message and exit |
| 85 | + --scenario {offline,server} |
| 86 | + MLPerf scenario mode |
| 87 | + --input-file INPUT_FILE |
| 88 | + Path to tokenized dataset (parquet or pickle file) |
| 89 | + --max-samples MAX_SAMPLES |
| 90 | + Maximum number of samples to use (None for all) |
| 91 | + --mlperf-conf MLPERF_CONF |
| 92 | + Path to MLPerf configuration file |
| 93 | + --user-conf USER_CONF |
| 94 | + Path to user configuration file |
| 95 | + --accuracy Run accuracy mode instead of performance |
| 96 | + --output-dir OUTPUT_DIR |
| 97 | + Directory for MLPerf output logs |
| 98 | + --backend {sglang} Backend to use for inference |
| 99 | + --server-url SERVER_URL |
| 100 | + Server URL for backend (SGLang) |
| 101 | + --generation-config GENERATION_CONFIG |
| 102 | + Path to generation configuration JSON file |
| 103 | + --max-new-tokens MAX_NEW_TOKENS |
| 104 | + Override max_new_tokens from generation config (default: use value from config) |
| 105 | + --num-workers NUM_WORKERS |
| 106 | + Number of worker threads (for server scenario) |
| 107 | + --max-concurrency MAX_CONCURRENCY |
| 108 | + Maximum concurrent requests to backend (SGLang handles batching internally) |
| 109 | + |
| 110 | +``` |
| 111 | + |
| 112 | +### Evaluate the accuracy |
| 113 | +Run `run_mlperf.py` with `--accuracy`, and then use the generated `mlperf_log_accuracy.json` to evaluate the accuracy of the run. |
| 114 | + |
| 115 | +Example usage: |
| 116 | +```bash |
| 117 | +# Using Parquet format (recommended) |
| 118 | +python3 eval_mlperf_accuracy.py \ |
| 119 | + --mlperf-log mlperf_results/offline/accuracy/mlperf_log_accuracy.json \ |
| 120 | + --reference-data /path/to/acc_eval_inputs.parquet \ |
| 121 | + --tokenizer openai/gpt-oss-120b |
| 122 | + |
| 123 | +# Using Pickle format (backward compatible) |
| 124 | +python3 eval_mlperf_accuracy.py \ |
| 125 | + --mlperf-log mlperf_results/offline/accuracy/mlperf_log_accuracy.json \ |
| 126 | + --reference-data /path/to/acc_eval_inputs.pkl \ |
| 127 | + --tokenizer openai/gpt-oss-120b |
| 128 | +``` |
| 129 | + |
| 130 | +Full command-line options: |
| 131 | +```bash |
| 132 | +python3 eval_mlperf_accuracy.py --help |
| 133 | +usage: eval_mlperf_accuracy.py [-h] --mlperf-log MLPERF_LOG --reference-data REFERENCE_DATA [--tokenizer TOKENIZER] [--output-file OUTPUT_FILE] |
| 134 | + [--save-outputs SAVE_OUTPUTS] [--num-lcb-workers NUM_LCB_WORKERS] [--verbose] |
| 135 | + |
| 136 | +Evaluate MLPerf accuracy logs for gpt-oss-120b |
| 137 | + |
| 138 | +options: |
| 139 | + -h, --help show this help message and exit |
| 140 | + --mlperf-log MLPERF_LOG |
| 141 | + Path to mlperf_log_accuracy.json |
| 142 | + --reference-data REFERENCE_DATA |
| 143 | + Path to reference parquet or pickle file (DataFrame with dataset, ground_truth, etc.) |
| 144 | + --tokenizer TOKENIZER |
| 145 | + HuggingFace tokenizer name or path |
| 146 | + --output-file OUTPUT_FILE |
| 147 | + Output JSON file for results (optional) |
| 148 | + --save-outputs SAVE_OUTPUTS |
| 149 | + Save detokenized outputs to pickle file (ordered by qsl_idx) for debugging |
| 150 | + --num-lcb-workers NUM_LCB_WORKERS |
| 151 | + Number of parallel workers for LiveCodeBench evaluation (default: 64) |
| 152 | + --verbose Verbose logging |
| 153 | + |
| 154 | +``` |
0 commit comments