Skip to content

Guidellm fails with SGLang 0.4.6 due to unlimited output token generation #477

@git-jxj

Description

@git-jxj

Describe the bug
Guidellm benchmarking fails when testing against SGLang 0.4.6 servers. While the same configuration works with SGLang 0.5.5.post3, requests sent to SGLang 0.4.6 result in extremely long generation sequences (hundreds to thousands of tokens), causing requests to hang and constraints like --max-requests to not work properly.

Expected behavior
guidellm can normally access sglang's OpenAI API interface

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 22.04]:
  2. Python version [e.g. 3.13.7]:

To Reproduce
Exact steps to reproduce the behavior:

  • Guidellm version: latest (from source)
  • Working SGLang version: 0.5.5.post3
  • Non-working SGLang version: 0.4.6
  • Model: Qwen3-32B
  • Test command:

1 guidellm benchmark
2 --target "http://ip:port"
3 --processor "/root/qwen-model/tokenizer/qwen3-32b"
4 --rate-type "concurrent"
5 --rate 1
6 --max-requests 1
7 --data "prompt_tokens=32,output_tokens=32,samples=1"
When using guidellm to benchmark against SGLang 0.4.6, the following issues occur:

  1. Unlimited token generation: Requests with output_tokens=64 generate hundreds or thousands of tokens instead of respecting the specified limit
  2. Request state management failure: The generated requests don't complete properly in guidellm, causing --max-requests constraints to fail

Errors

Image Image Parameter mapping issue: Guidellm's output_tokens=64 parameter does not properly map to SGLang 0.4.6's max_tokens parameter

Additional context
Add any other context about the problem here. Also include any relevant files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions