[Bug] Performance Drop in GuideLLM vs `vllm bench serve`  at1k Requests Throughput

**Describe the bug**  
When running a throughput benchmark with GuideLLM v0.3.1 using 1000 requests (`--max-requests 1000`), the measured performance is approximately 6% slower (both in throughput and total time) compared to the official `vllm bench serve` command with equivalent parameters (`--num-prompts 1000`, `--random-input-len 1024`, `--random-output-len 256`). However, this discrepancy does **not** occur when using only 100 requests—performance aligns closely in that case. This suggests a scaling or batching issue in GuideLLM that manifests under higher load.

**Expected behavior**  
GuideLLM should produce performance metrics (throughput, latency, total time) that are consistent with `vllm bench serve` under equivalent workload conditions (same model, prompt/output lengths, and number of requests), regardless of request count (100 vs. 1000).

**Environment**  
Include all relevant environment information:  
1. OS : Ubuntu 20.04
2. Python : 3.12
3. GPU: A100 * 8
4. vLLM: 0.10.2

**To Reproduce**  
Exact steps to reproduce the behavior:  
1. Start a vLLM server:  
   ```bash
   vllm serve --model /Qwen3-30B-A3B -tp 8 -pp 1 --enable-prefix-caching --enable-chuncked-prefill
   ```  
2. Run GuideLLM benchmark:  
   ```bash
   guidellm benchmark --target "http://localhost:8000" --rate-type throughput --max-requests 1000 --data "samples=1000,prompt_tokens=1024,output_tokens=256"
   ```  
3. Run official vLLM benchmark for comparison:  
   ```bash
   vllm bench serve --model /Qwen3-30B-A3B --random-input-len 1024 --random-output-len 256 --disable-tqdm --num-prompts 1000
   ```  
4. Compare throughput (requests/sec) and total execution time.  
5. Repeat with `--max-requests 100` (GuideLLM) and `--num-prompts 100` (vLLM) to observe alignment.

**Errors**  
No explicit errors are raised. 

**Additional context**  
- The model used is Qwen3-30B-A3B.  
- Both benchmarks use identical prompt/output token lengths and request counts.  
- The issue only appears at higher concurrency/request volume (1000), not at lower (100).  
- This may relate to request scheduling, connection pooling, or backend handling differences in GuideLLM under load.  

**Screenshots**

**1000 requests** 

<img width="400" height="300" alt="Image" src="https://github.com/user-attachments/assets/4d640d39-aef0-46d5-8f29-be77be5684ad" />
<img width="1515" height="314" alt="Image" src="https://github.com/user-attachments/assets/eaf71e5c-98c0-40e9-b683-04754349fb21" />

**100 requests**
<img width="400" height="300" alt="Image" src="https://github.com/user-attachments/assets/9d451547-a054-4d51-854f-d32077abe8c5" />
<img width="1515" height="340" alt="Image" src="https://github.com/user-attachments/assets/8e26826c-cf2f-4f23-8085-58597c75b72f" />



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Performance Drop in GuideLLM vs `vllm bench serve` at1k Requests Throughput #470

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Performance Drop in GuideLLM vs vllm bench serve at1k Requests Throughput #470

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Performance Drop in GuideLLM vs `vllm bench serve` at1k Requests Throughput #470