-
Notifications
You must be signed in to change notification settings - Fork 195
Description
What would you like to be added:
The current Chat Completion types in the EPP scheduling code is missing fields that are required to accurately create a prompt like the model will actually see from the Chat Completion request. This means downstream projects, like llm-d-inference-scheduler, also cannot accurately do things like prefix-aware routing because they cannot reconstruct the actual prompts that models will see.
Important fields missing from Message that make their way into prompts for most newer models:
- reasoning
- tool_calls
- tool_call_id
There are also additional content types supported by inference servers like vLLM for multi-modal content other than just the text and image_url implemented here - see https://docs.vllm.ai/en/v0.11.0/examples/online_serving/openai_chat_completion_client_for_multimodal.html for an example of some of those.
To be future-proof, you'd want to support the entire Chat Completions API surface as well as additional fields, like reasoning that are commonly used in model chat templates and/or libraries that turn requests to prompts like MistralTokenizer or openai-harmony used by vLLM. Basically, a superset of all fields accepted by every supported inference server that can influence the prompt itself is what's required to do this accurately.
Why is this needed:
Without supporting all the Chat Completions fields (both part of the official spec and otherwise implemented by vLLM or other inference servers), the usefulness of prefix-aware routing and similar concepts in real production scenarios will be quite limited as the projects calculating the prefixes (such as llm-d-inference-scheduler) will not have all the necessary information available to accurately reconstruct prompts as the inference server will actually see them.