|
| 1 | +# Benchmark KV Cache Offloading with Multi-Turn Conversations |
| 2 | + |
| 3 | +The requirements (pip) for `benchmark_serving_multi_turn.py` can be found in `requirements.txt` |
| 4 | + |
| 5 | +First start serving your model |
| 6 | + |
| 7 | +```bash |
| 8 | +export MODEL_NAME=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/ |
| 9 | + |
| 10 | +vllm serve $MODEL_NAME --disable-log-requests |
| 11 | +``` |
| 12 | + |
| 13 | +## Synthetic Multi-Turn Conversations |
| 14 | + |
| 15 | +Download the following text file (used for generation of synthetic conversations) |
| 16 | + |
| 17 | +```bash |
| 18 | +wget https://www.gutenberg.org/ebooks/1184.txt.utf-8 |
| 19 | +mv 1184.txt.utf-8 pg1184.txt |
| 20 | +``` |
| 21 | + |
| 22 | +The filename `pg1184.txt` is used in `generate_multi_turn.json` (see `"text_files"`). |
| 23 | + |
| 24 | +But you may use other text files if you prefer (using this specific file is not required). |
| 25 | + |
| 26 | +Then run the benchmarking script |
| 27 | + |
| 28 | +```bash |
| 29 | +export MODEL_NAME=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/ |
| 30 | + |
| 31 | +python benchmark_serving_multi_turn.py --model $MODEL_NAME --input-file generate_multi_turn.json \ |
| 32 | +--num-clients 2 --max-active-conversations 6 |
| 33 | +``` |
| 34 | + |
| 35 | +You can edit the file `generate_multi_turn.json` to change the conversation parameters (number of turns, etc.). |
| 36 | + |
| 37 | +If successful, you will see the following output |
| 38 | + |
| 39 | +```bash |
| 40 | +---------------------------------------------------------------------------------------------------- |
| 41 | +Statistics summary: |
| 42 | +runtime_sec = 215.810 |
| 43 | +requests_per_sec = 0.769 |
| 44 | +---------------------------------------------------------------------------------------------------- |
| 45 | + count mean std min 25% 50% 75% 90% 99% max |
| 46 | +ttft_ms 166.0 78.22 67.63 45.91 59.94 62.26 64.43 69.66 353.18 567.54 |
| 47 | +tpot_ms 166.0 25.37 0.57 24.40 25.07 25.31 25.50 25.84 27.50 28.05 |
| 48 | +latency_ms 166.0 2591.07 326.90 1998.53 2341.62 2573.01 2860.10 3003.50 3268.46 3862.94 |
| 49 | +input_num_turns 166.0 7.43 4.57 1.00 3.00 7.00 11.00 13.00 17.00 17.00 |
| 50 | +input_num_tokens 166.0 2006.20 893.56 522.00 1247.75 2019.00 2718.00 3233.00 3736.45 3899.00 |
| 51 | +output_num_tokens 166.0 100.01 11.80 80.00 91.00 99.00 109.75 116.00 120.00 120.00 |
| 52 | +output_num_chunks 166.0 99.01 11.80 79.00 90.00 98.00 108.75 115.00 119.00 119.00 |
| 53 | +---------------------------------------------------------------------------------------------------- |
| 54 | +``` |
| 55 | + |
| 56 | +## ShareGPT Conversations |
| 57 | + |
| 58 | +To run with the ShareGPT data, download the following ShareGPT dataset: |
| 59 | +`https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json` |
| 60 | + |
| 61 | +Use the `convert_sharegpt_to_openai.py` script to convert the dataset to a format supported by `benchmark_serving_multi_turn.py` |
| 62 | + |
| 63 | +```bash |
| 64 | +python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128 |
| 65 | +``` |
| 66 | + |
| 67 | +The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles. |
| 68 | + |
| 69 | +The flag `--max-items=128` is used to sample 128 conversations from the original dataset (change as needed). |
| 70 | + |
| 71 | +Use the output JSON file `sharegpt_conv_128.json` as the `--input-file` for `benchmark_serving_multi_turn.py`. |
0 commit comments