Skip to content

[Bug]: max batched tokens not compatible with max model length on non-X86 CPU Backend #28981

@fadara01

Description

@fadara01

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here

🐛 Describe the bug

For non-x86 CPU backends it seems chunked prefill isn’t supported ([arg_utils.py:1376] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.) and there isn’t code to update max_num_batched_tokens to be compatible with max_model_len

(APIServer pid=63635) pydantic_core._pydantic_core.ValidationError: 1 validation error for SchedulerConfig
(APIServer pid=63635)   Value error, max_num_batched_tokens (2048) is smaller than max_model_len (40960). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'runner_t..., 'stream_interval': 1}), input_type=ArgsKwargs]

This means the user has to manually set this themselves.

We should self.max_num_batched_tokens = model_config.max_model_len when chunked prefill is disabled.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcpuRelated to CPU backends

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions