-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Open
Labels
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
For non-x86 CPU backends it seems chunked prefill isn’t supported ([arg_utils.py:1376] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.) and there isn’t code to update max_num_batched_tokens to be compatible with max_model_len
(APIServer pid=63635) pydantic_core._pydantic_core.ValidationError: 1 validation error for SchedulerConfig
(APIServer pid=63635) Value error, max_num_batched_tokens (2048) is smaller than max_model_len (40960). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'runner_t..., 'stream_interval': 1}), input_type=ArgsKwargs]
This means the user has to manually set this themselves.
We should self.max_num_batched_tokens = model_config.max_model_len when chunked prefill is disabled.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.