-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Describe the bug
Guidellm benchmarking fails when testing against SGLang 0.4.6 servers. While the same configuration works with SGLang 0.5.5.post3, requests sent to SGLang 0.4.6 result in extremely long generation sequences (hundreds to thousands of tokens), causing requests to hang and constraints like --max-requests to not work properly.
Expected behavior
guidellm can normally access sglang's OpenAI API interface
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 22.04]:
- Python version [e.g. 3.13.7]:
To Reproduce
Exact steps to reproduce the behavior:
- Guidellm version: latest (from source)
- Working SGLang version: 0.5.5.post3
- Non-working SGLang version: 0.4.6
- Model: Qwen3-32B
- Test command:
1 guidellm benchmark
2 --target "http://ip:port"
3 --processor "/root/qwen-model/tokenizer/qwen3-32b"
4 --rate-type "concurrent"
5 --rate 1
6 --max-requests 1
7 --data "prompt_tokens=32,output_tokens=32,samples=1"
When using guidellm to benchmark against SGLang 0.4.6, the following issues occur:
- Unlimited token generation: Requests with output_tokens=64 generate hundreds or thousands of tokens instead of respecting the specified limit
- Request state management failure: The generated requests don't complete properly in guidellm, causing --max-requests constraints to fail
Errors
Parameter mapping issue: Guidellm's output_tokens=64 parameter does not properly map to SGLang 0.4.6's max_tokens parameter
Additional context
Add any other context about the problem here. Also include any relevant files.