-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Bug Report: VLLMModel breaks when using vllm > 0.10.1
Description
VLLMModel in smolagents breaks when using vllm version 0.10.1 or higher due to API changes in vllm that removed the guided_decoding_backend
parameter.
Steps to Reproduce
- Install vllm > 0.10.1
- Install smolagents 1.22.0
- Initialize a VLLMModel
- Create a CodeAgent with the VLLMModel
- Run GradioUI with the CodeAgent
- Chat with the agent
Code to Reproduce
from smolagents import VLLMModel, CodeAgent, GradioUI
def main():
model = VLLMModel(
model_id="HuggingFaceTB/SmolLM3-3B",
model_kwargs={
"max_model_len": 4096,
"max_num_batched_tokens": 4096,
}
)
agent = CodeAgent(model=model, tools=[])
gradio_ui = GradioUI(agent)
gradio_ui.launch()
if __name__ == "__main__":
main()
Expected Behavior
The agent should work normally with vllm 0.10.1+.
Actual Behavior
The following exception is raised:
gradio.exceptions.Error: "Error in interaction: Error in generating model output:\nLLM.generate() got an unexpected keyword argument 'guided_options_request'"
Root Cause
Starting from vllm 0.10.1, guided_decoding_backend
was removed in PR #21347. According to the vllm structured outputs documentation, the migration path is to remove guided_decoding_backend
and replace it with structured_outputs
within the sampling_params using StructuredOutputsParams
.
Proposed Solution
The VLLMModel.generate()
method needs to be updated to convert the old guided_options_request
format to the new structured_outputs
format. Here's a potential fix:
class PatchedVLLMModel(VLLMModel):
def generate(
self,
messages,
stop_sequences=None,
response_format=None,
tools_to_call_from=None,
**kwargs,
) -> ChatMessage:
# NOTE: This overrides smolagents' VLLMModel.generate to convert
# the old 'guided_options_request' to the new 'structured_outputs' format.
from vllm import SamplingParams # type: ignore
from vllm.sampling_params import StructuredOutputsParams # type: ignore
completion_kwargs = self._prepare_completion_kwargs(
messages=messages,
flatten_messages_as_text=(not self._is_vlm),
stop_sequences=stop_sequences,
tools_to_call_from=tools_to_call_from,
**kwargs,
)
messages = completion_kwargs.pop("messages")
prepared_stop_sequences = completion_kwargs.pop("stop", [])
tools = completion_kwargs.pop("tools", None)
completion_kwargs.pop("tool_choice", None)
prompt = self.tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False,
)
# Convert old guided_options_request format to new structured_outputs
structured_outputs_params = None
if response_format:
if "json_schema" in response_format:
# Extract the JSON schema from the response_format
json_schema = response_format["json_schema"]["schema"]
structured_outputs_params = StructuredOutputsParams(json=json_schema)
elif "choice" in response_format:
# Handle choice-based structured outputs
structured_outputs_params = StructuredOutputsParams(choice=response_format["choice"])
elif "regex" in response_format:
# Handle regex-based structured outputs
structured_outputs_params = StructuredOutputsParams(regex=response_format["regex"])
elif "grammar" in response_format:
# Handle grammar-based structured outputs
structured_outputs_params = StructuredOutputsParams(grammar=response_format["grammar"])
elif "structural_tag" in response_format:
# Handle structural tag-based structured outputs
structured_outputs_params = StructuredOutputsParams(structural_tag=response_format["structural_tag"])
else:
print(f"WARNING: Unsupported response_format type: {response_format}")
structured_outputs_params = None
sampling_params = SamplingParams(
n=kwargs.get("n", 1),
temperature=kwargs.get("temperature", 0.7),
max_tokens=kwargs.get("max_tokens", 64),
stop=prepared_stop_sequences,
structured_outputs=structured_outputs_params,
)
out = self.model.generate(
prompt,
sampling_params=sampling_params,
)
output_text = out[0].outputs[0].text
return ChatMessage(
role=MessageRole.ASSISTANT,
content=output_text,
raw={"out": output_text, "completion_kwargs": completion_kwargs},
token_usage=TokenUsage(
input_tokens=len(out[0].prompt_token_ids),
output_tokens=len(out[0].outputs[0].token_ids),
),
)
Environment
- Python version: Python 3.12.10
- smolagents version: 1.22.0
- vllm version: 0.11.0
- OS: macOS 15.6.1
Additional Context
This is a breaking change in vllm that affects backward compatibility. The fix should maintain compatibility with both older and newer versions of vllm if possible.