servegen_benchmark

servegen_benchmark is the actual benchmark to run a workload generated by ServeGen.
This benchmark is designed to be run in a manner similar to vllm bench serve.

Warning

This benchmark is not an official one. It is being developed primarily for personal use.

ServeGen is the powerful tool to generate realistic LLM inference workloads.
This benchmark tool can actually run benchmarks based on workload data generated by ServeGen.

Currently, it only supports loading and executing CSV files.
Therefore, You must export the generated workloads to CSV.
For details, please refer to the example of ServeGen (examples/basic_usage.py)

Setup

I recommend the uv Python package manager, although any way works.

uv sync --all

You can see the supported flags by passing --help .
The parts of this flagsets are subsets of vllm bench serve.

uv run python servegen_benchmark.py --help
INFO 10-20 14:05:22 [__init__.py:216] Automatically detected platform cpu.
usage: servegen_benchmark.py [-h] --csv CSV --model MODEL [--base-url BASE_URL] [--host HOST] [--port PORT]
                             [--endpoint ENDPOINT] [--header [KEY=VALUE ...]]
                             [--max-concurrency MAX_CONCURRENCY] [--tokenizer TOKENIZER]
                             [--tokenizer-mode {auto,slow,mistral,custom}] [--trust-remote-code]
                             [--disable_tqdm] [--seed SEED] [--percentile-metrics PERCENTILE_METRICS]
                             [--metric-percentiles METRIC_PERCENTILES] [--goodput GOODPUT [GOODPUT ...]]
                             [--dryrun] [--dryrun-out DRYRUN_OUT] [--dump-failure] [--dump-failure-details]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR}]

This is description

options:
  -h, --help            show this help message and exit
  --csv CSV             CSV file generated by ServeGen
  --model MODEL         Model
  --base-url BASE_URL   Server or API base url if not using http host and port.
  --host HOST
  --port PORT
  --endpoint ENDPOINT   API endpoint.
  --header [KEY=VALUE ...]
                        Key-value pairs (e.g, --header x-additional-info=0.3.3) for headers to be passed
                        with each request. These headers override per backend constants and values set via
                        environment variable, and will be overriden by other arguments (such as request
                        ids).
  --max-concurrency MAX_CONCURRENCY
                        Maximum number of concurrent requests. This can be used to help simulate an
                        environment where a higher level component is enforcing a maximum number of
                        concurrent requests. While the --request-rate argument controls the rate at which
                        requests are initiated, this argument will control how many are actually allowed to
                        execute at a time. This means that when used in combination, the actual request rate
                        may be lower than specified with --request-rate, if the server is not processing
                        requests fast enough to keep up.
  --tokenizer TOKENIZER
                        Name or path of the tokenizer, if not using the default tokenizer.
  --tokenizer-mode {auto,slow,mistral,custom}
                        The tokenizer mode. * "auto" will use the fast tokenizer if available. * "slow" will
                        always use the slow tokenizer. * "mistral" will always use the `mistral_common`
                        tokenizer. *"custom" will use --tokenizer to select the preregistered tokenizer.
  --trust-remote-code   Trust remote code from huggingface
  --disable_tqdm        Specify to disable tqdm progress bar.
  --seed SEED
  --percentile-metrics PERCENTILE_METRICS
                        Comma-separated list of selected metrics to report percentils. This argument
                        specifies the metrics to report percentiles. Allowed metric names are "ttft",
                        "tpot", "itl", "e2el". If not specified, defaults to "ttft,tpot,itl" for generative
                        models and "e2el" for pooling models.
  --metric-percentiles METRIC_PERCENTILES
                        Comma-separated list of percentiles for selected metrics. To report 25-th, 50-th,
                        and 75-th percentiles, use "25,50,75". Default value is "99".Use "--percentile-
                        metrics" to select metrics.
  --goodput GOODPUT [GOODPUT ...]
                        Specify service level objectives for goodput as "KEY:VALUE" pairs, where the key is
                        a metric name, and the value is in milliseconds. Multiple "KEY:VALUE" pairs can be
                        provided, separated by spaces. Allowed request level metric names are "ttft",
                        "tpot", "e2el". For more context on the definition of goodput, refer to DistServe
                        paper: https://arxiv.org/pdf/2401.09670 and the blog: https://hao-ai-
                        lab.github.io/blogs/distserve
  --dryrun              Do not send requests. Dump planned requests as CSV.
  --dryrun-out DRYRUN_OUT
                        Path to write the dryrun CSV (default: stdout).
  --dump-failure        If set, print failure request counts and success rate.
  --dump-failure-details
                        If set, write details for failure request to failure_request.log. please use this
                        flag with --dump-failure
  --loglevel {DEBUG,INFO,WARNING,ERROR}
                        Logging level (default: INFO).

Usage

First of all, you must get the generated csv file from ServeGen like bellow.

git clone https://github.com/alibaba/ServeGen.git
cd ServeGen

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

# Generate 'workload.csv'
python examples/basic_usage.py

Then, you can run benchmark using generated workload CSV file.

# dryrun
uv run python servegen_benchmark.py \
  --model openai/gpt-oss-20b \
  --csv workload.csv \
  --dryrun

# Run
uv run python servegen_benchmark.py \
  --model openai/gpt-oss-20b \
  --csv workload.csv \
  --host 0.0.0.0 \
  --port 8000

# Run w/ failure information
uv run python servegen_benchmark.py \
  --model openai/gpt-oss-20b \
  --csv workload.csv \
  --host 0.0.0.0 \
  --port 8000 \
  --dump-failure

# Run w/ failure information and failure request details
uv run python servegen_benchmark.py \
  --model openai/gpt-oss-20b \
  --csv workload.csv \
  --host 0.0.0.0 \
  --port 8000 \
  --dump-failure --dump-failure-details

Limitations

At this time, we do not support seamless integration with ServeGen.
You must first save the generated workloads using ServeGen to a CSV file and then provide it to this program.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
servegen_benchmark.py		servegen_benchmark.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

servegen_benchmark

Setup

Usage

Limitations

About

Uh oh!

Releases

Packages

Languages

aztecher/servegen_benchmark

Folders and files

Latest commit

History

Repository files navigation

servegen_benchmark

Setup

Usage

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages