|
| 1 | +## Getting dataset: |
| 2 | + |
| 3 | +The dataset can be obtained from the LLM task-force which is in the process of finalizing the contents of the dataset for both performance and accuracy. The dataset is in a parquet format. |
| 4 | + |
| 5 | +## Launch server: |
| 6 | + |
| 7 | +Common configs: |
| 8 | + |
| 9 | +``` |
| 10 | +export HF_HOME=<your_hf_home_dir> |
| 11 | +export HF_TOKEN=<your_hf_token> |
| 12 | +export MODEL_NAME=openai/gpt-oss-120b |
| 13 | +
|
| 14 | +``` |
| 15 | + |
| 16 | +`vLLM` can be launched via: |
| 17 | + |
| 18 | +``` |
| 19 | +docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} --gpu_memory_utilization 0.95 |
| 20 | +``` |
| 21 | + |
| 22 | +`SGLang` can be launched via: |
| 23 | + |
| 24 | +``` |
| 25 | +docker run --runtime nvidia --gpus all --net host -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path ${MODEL_NAME} --host 0.0.0.0 --port 3000 --data-parallel-size=1 --max-running-requests 512 --mem-fraction-static 0.85 --chunked-prefill-size 16384 --ep-size=1 --enable-metrics --stream-interval 500 |
| 26 | +``` |
| 27 | + |
| 28 | +## Launch benchmark: |
| 29 | + |
| 30 | +``` |
| 31 | +inference-endpoint benchmark from-config -c examples/04_GPTOSS120B_Example/gptoss_120b_example.yaml --timeout 6000 |
| 32 | +``` |
| 33 | + |
| 34 | +## vllm bench: |
| 35 | + |
| 36 | +`vllm bench serve` provided support for custom datasets only via the `jsonl` format. We can convert the parquet files to `jsonl` via the following script: |
| 37 | + |
| 38 | +``` |
| 39 | +import pandas as pd |
| 40 | +
|
| 41 | +parquet_file = 'examples/04_GPTOSS120B_Example/data/perf_eval_ref.parquet' |
| 42 | +json_file = 'examples/04_GPTOSS120B_Example/data/perf_eval_ref.jsonl' |
| 43 | +
|
| 44 | +# 1. Read the original file |
| 45 | +df = pd.read_parquet(parquet_file) |
| 46 | +
|
| 47 | +# 2. Rename the column(s) |
| 48 | +# Use a dictionary mapping old names to new names |
| 49 | +df = df.rename(columns={'prompt': 'raw_prompt'}) |
| 50 | +df = df.rename(columns={'text_input': 'prompt'}) |
| 51 | +
|
| 52 | +# 3. Write the renamed DataFrame to a new file |
| 53 | +df.to_json(json_file, orient='records', lines=True) |
| 54 | +
|
| 55 | +``` |
| 56 | + |
| 57 | +Note that it also renames the column from `text_input` to `prompt` as the custom dataloader requires the `jsonl` to have the pre-processed prompt under that name. |
| 58 | +We can launch the benchmarking command but it has to be pointed to the `completions` endpoint instead of the `chat-completions` endpoint as the prompt is preprocessed. While the numbers generated cannot be directly compared to inference-endpoint (which uses the `chat-completion` endpoint), it can provide a good reference for relative performance given the output token distribution. |
| 59 | + |
| 60 | +``` |
| 61 | +vllm bench serve --backend vllm --model ${MODEL_NAME} --endpoint /v1/completions --dataset-name custom --dataset-path ${PATH_TO_DATASETS}/acc_eval_inputs.jsonl --custom-output-len 2000 --num-prompts 6396 --max-concurrency 512 --save-result --save-detailed |
| 62 | +
|
| 63 | +``` |
| 64 | + |
| 65 | +## Debugging: |
| 66 | + |
| 67 | +[mitmproxy](https://www.mitmproxy.org/) is a tool that can help debug HTTP requests and responses to understand the differences in payload for different scenarios. For our use case, we would like to be able to inspect the HTTP requests and responses between the benchmarking client and the server. We can run `mitmproxy` in a reverse-proxy mode as below: |
| 68 | + |
| 69 | +``` |
| 70 | +mitmproxy -p 8001 --mode reverse:http://localhost:8000/ |
| 71 | +``` |
| 72 | + |
| 73 | +This launches `mitmproxy` at port `8081` and forwards it to `8000` on the local machine. Now our server (`vLLM` or `SGLang`) can run on port `8000` and our client will send requests to `8001` which will be logged and forwarded to the server. The client will receive the response back transparently with the responses being logged as well. This allows us to inspect the exact |
0 commit comments