Skip to content

[Bug] Performance Drop in GuideLLM vs vllm bench serve at1k Requests Throughput #470

@AiKiAi-stack

Description

@AiKiAi-stack

Describe the bug
When running a throughput benchmark with GuideLLM v0.3.1 using 1000 requests (--max-requests 1000), the measured performance is approximately 6% slower (both in throughput and total time) compared to the official vllm bench serve command with equivalent parameters (--num-prompts 1000, --random-input-len 1024, --random-output-len 256). However, this discrepancy does not occur when using only 100 requests—performance aligns closely in that case. This suggests a scaling or batching issue in GuideLLM that manifests under higher load.

Expected behavior
GuideLLM should produce performance metrics (throughput, latency, total time) that are consistent with vllm bench serve under equivalent workload conditions (same model, prompt/output lengths, and number of requests), regardless of request count (100 vs. 1000).

Environment
Include all relevant environment information:

  1. OS : Ubuntu 20.04
  2. Python : 3.12
  3. GPU: A100 * 8
  4. vLLM: 0.10.2

To Reproduce
Exact steps to reproduce the behavior:

  1. Start a vLLM server:
    vllm serve --model /Qwen3-30B-A3B -tp 8 -pp 1 --enable-prefix-caching --enable-chuncked-prefill
  2. Run GuideLLM benchmark:
    guidellm benchmark --target "http://localhost:8000" --rate-type throughput --max-requests 1000 --data "samples=1000,prompt_tokens=1024,output_tokens=256"
  3. Run official vLLM benchmark for comparison:
    vllm bench serve --model /Qwen3-30B-A3B --random-input-len 1024 --random-output-len 256 --disable-tqdm --num-prompts 1000
  4. Compare throughput (requests/sec) and total execution time.
  5. Repeat with --max-requests 100 (GuideLLM) and --num-prompts 100 (vLLM) to observe alignment.

Errors
No explicit errors are raised.

Additional context

  • The model used is Qwen3-30B-A3B.
  • Both benchmarks use identical prompt/output token lengths and request counts.
  • The issue only appears at higher concurrency/request volume (1000), not at lower (100).
  • This may relate to request scheduling, connection pooling, or backend handling differences in GuideLLM under load.

Screenshots

1000 requests

Image Image

100 requests
Image
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions