servegen_benchmark is the actual benchmark to run a workload generated by ServeGen.
This benchmark is designed to be run in a manner similar to vllm bench serve.
Warning
This benchmark is not an official one. It is being developed primarily for personal use.
ServeGen is the powerful tool to generate realistic LLM inference workloads.
This benchmark tool can actually run benchmarks based on workload data generated by ServeGen.
Currently, it only supports loading and executing CSV files.
Therefore, You must export the generated workloads to CSV.
For details, please refer to the example of ServeGen (examples/basic_usage.py)
I recommend the uv Python package manager, although any way works.
uv sync --allYou can see the supported flags by passing --help .
The parts of this flagsets are subsets of vllm bench serve.
uv run python servegen_benchmark.py --help
INFO 10-20 14:05:22 [__init__.py:216] Automatically detected platform cpu.
usage: servegen_benchmark.py [-h] --csv CSV --model MODEL [--base-url BASE_URL] [--host HOST] [--port PORT]
[--endpoint ENDPOINT] [--header [KEY=VALUE ...]]
[--max-concurrency MAX_CONCURRENCY] [--tokenizer TOKENIZER]
[--tokenizer-mode {auto,slow,mistral,custom}] [--trust-remote-code]
[--disable_tqdm] [--seed SEED] [--percentile-metrics PERCENTILE_METRICS]
[--metric-percentiles METRIC_PERCENTILES] [--goodput GOODPUT [GOODPUT ...]]
[--dryrun] [--dryrun-out DRYRUN_OUT] [--dump-failure] [--dump-failure-details]
[--loglevel {DEBUG,INFO,WARNING,ERROR}]
This is description
options:
-h, --help show this help message and exit
--csv CSV CSV file generated by ServeGen
--model MODEL Model
--base-url BASE_URL Server or API base url if not using http host and port.
--host HOST
--port PORT
--endpoint ENDPOINT API endpoint.
--header [KEY=VALUE ...]
Key-value pairs (e.g, --header x-additional-info=0.3.3) for headers to be passed
with each request. These headers override per backend constants and values set via
environment variable, and will be overriden by other arguments (such as request
ids).
--max-concurrency MAX_CONCURRENCY
Maximum number of concurrent requests. This can be used to help simulate an
environment where a higher level component is enforcing a maximum number of
concurrent requests. While the --request-rate argument controls the rate at which
requests are initiated, this argument will control how many are actually allowed to
execute at a time. This means that when used in combination, the actual request rate
may be lower than specified with --request-rate, if the server is not processing
requests fast enough to keep up.
--tokenizer TOKENIZER
Name or path of the tokenizer, if not using the default tokenizer.
--tokenizer-mode {auto,slow,mistral,custom}
The tokenizer mode. * "auto" will use the fast tokenizer if available. * "slow" will
always use the slow tokenizer. * "mistral" will always use the `mistral_common`
tokenizer. *"custom" will use --tokenizer to select the preregistered tokenizer.
--trust-remote-code Trust remote code from huggingface
--disable_tqdm Specify to disable tqdm progress bar.
--seed SEED
--percentile-metrics PERCENTILE_METRICS
Comma-separated list of selected metrics to report percentils. This argument
specifies the metrics to report percentiles. Allowed metric names are "ttft",
"tpot", "itl", "e2el". If not specified, defaults to "ttft,tpot,itl" for generative
models and "e2el" for pooling models.
--metric-percentiles METRIC_PERCENTILES
Comma-separated list of percentiles for selected metrics. To report 25-th, 50-th,
and 75-th percentiles, use "25,50,75". Default value is "99".Use "--percentile-
metrics" to select metrics.
--goodput GOODPUT [GOODPUT ...]
Specify service level objectives for goodput as "KEY:VALUE" pairs, where the key is
a metric name, and the value is in milliseconds. Multiple "KEY:VALUE" pairs can be
provided, separated by spaces. Allowed request level metric names are "ttft",
"tpot", "e2el". For more context on the definition of goodput, refer to DistServe
paper: https://arxiv.org/pdf/2401.09670 and the blog: https://hao-ai-
lab.github.io/blogs/distserve
--dryrun Do not send requests. Dump planned requests as CSV.
--dryrun-out DRYRUN_OUT
Path to write the dryrun CSV (default: stdout).
--dump-failure If set, print failure request counts and success rate.
--dump-failure-details
If set, write details for failure request to failure_request.log. please use this
flag with --dump-failure
--loglevel {DEBUG,INFO,WARNING,ERROR}
Logging level (default: INFO).First of all, you must get the generated csv file from ServeGen like bellow.
git clone https://github.com/alibaba/ServeGen.git
cd ServeGen
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
# Generate 'workload.csv'
python examples/basic_usage.pyThen, you can run benchmark using generated workload CSV file.
# dryrun
uv run python servegen_benchmark.py \
--model openai/gpt-oss-20b \
--csv workload.csv \
--dryrun
# Run
uv run python servegen_benchmark.py \
--model openai/gpt-oss-20b \
--csv workload.csv \
--host 0.0.0.0 \
--port 8000
# Run w/ failure information
uv run python servegen_benchmark.py \
--model openai/gpt-oss-20b \
--csv workload.csv \
--host 0.0.0.0 \
--port 8000 \
--dump-failure
# Run w/ failure information and failure request details
uv run python servegen_benchmark.py \
--model openai/gpt-oss-20b \
--csv workload.csv \
--host 0.0.0.0 \
--port 8000 \
--dump-failure --dump-failure-detailsAt this time, we do not support seamless integration with ServeGen.
You must first save the generated workloads using ServeGen to a CSV file and then provide it to this program.