Skip to content

BUG: VLLMModel breaks when using vllm > 0.10.1 #1794

@qjflores

Description

@qjflores

Bug Report: VLLMModel breaks when using vllm > 0.10.1

Description

VLLMModel in smolagents breaks when using vllm version 0.10.1 or higher due to API changes in vllm that removed the guided_decoding_backend parameter.

Steps to Reproduce

  1. Install vllm > 0.10.1
  2. Install smolagents 1.22.0
  3. Initialize a VLLMModel
  4. Create a CodeAgent with the VLLMModel
  5. Run GradioUI with the CodeAgent
  6. Chat with the agent

Code to Reproduce

from smolagents import VLLMModel, CodeAgent, GradioUI


def main():
    model = VLLMModel(
        model_id="HuggingFaceTB/SmolLM3-3B",
        model_kwargs={
            "max_model_len": 4096,
            "max_num_batched_tokens": 4096,
        }
    )
    
    agent = CodeAgent(model=model, tools=[])
    gradio_ui = GradioUI(agent)
    gradio_ui.launch()


if __name__ == "__main__":
    main()

Expected Behavior

The agent should work normally with vllm 0.10.1+.

Actual Behavior

The following exception is raised:

gradio.exceptions.Error: "Error in interaction: Error in generating model output:\nLLM.generate() got an unexpected keyword argument 'guided_options_request'"

Root Cause

Starting from vllm 0.10.1, guided_decoding_backend was removed in PR #21347. According to the vllm structured outputs documentation, the migration path is to remove guided_decoding_backend and replace it with structured_outputs within the sampling_params using StructuredOutputsParams.

Proposed Solution

The VLLMModel.generate() method needs to be updated to convert the old guided_options_request format to the new structured_outputs format. Here's a potential fix:

class PatchedVLLMModel(VLLMModel):
    def generate(
        self,
        messages,
        stop_sequences=None,
        response_format=None,
        tools_to_call_from=None,
        **kwargs,
    ) -> ChatMessage:
        # NOTE: This overrides smolagents' VLLMModel.generate to convert
        # the old 'guided_options_request' to the new 'structured_outputs' format.
        from vllm import SamplingParams  # type: ignore
        from vllm.sampling_params import StructuredOutputsParams  # type: ignore

        completion_kwargs = self._prepare_completion_kwargs(
            messages=messages,
            flatten_messages_as_text=(not self._is_vlm),
            stop_sequences=stop_sequences,
            tools_to_call_from=tools_to_call_from,
            **kwargs,
        )

        messages = completion_kwargs.pop("messages")
        prepared_stop_sequences = completion_kwargs.pop("stop", [])
        tools = completion_kwargs.pop("tools", None)
        completion_kwargs.pop("tool_choice", None)

        prompt = self.tokenizer.apply_chat_template(
            messages,
            tools=tools,
            add_generation_prompt=True,
            tokenize=False,
        )

        # Convert old guided_options_request format to new structured_outputs
        structured_outputs_params = None
        if response_format:
            if "json_schema" in response_format:
                # Extract the JSON schema from the response_format
                json_schema = response_format["json_schema"]["schema"]
                structured_outputs_params = StructuredOutputsParams(json=json_schema)
            elif "choice" in response_format:
                # Handle choice-based structured outputs
                structured_outputs_params = StructuredOutputsParams(choice=response_format["choice"])
            elif "regex" in response_format:
                # Handle regex-based structured outputs
                structured_outputs_params = StructuredOutputsParams(regex=response_format["regex"])
            elif "grammar" in response_format:
                # Handle grammar-based structured outputs
                structured_outputs_params = StructuredOutputsParams(grammar=response_format["grammar"])
            elif "structural_tag" in response_format:
                # Handle structural tag-based structured outputs
                structured_outputs_params = StructuredOutputsParams(structural_tag=response_format["structural_tag"])
            else:
                print(f"WARNING: Unsupported response_format type: {response_format}")
                structured_outputs_params = None

        sampling_params = SamplingParams(
            n=kwargs.get("n", 1),
            temperature=kwargs.get("temperature", 0.7),
            max_tokens=kwargs.get("max_tokens", 64),
            stop=prepared_stop_sequences,
            structured_outputs=structured_outputs_params,
        )

        out = self.model.generate(
            prompt,
            sampling_params=sampling_params,
        )

        output_text = out[0].outputs[0].text
        
        return ChatMessage(
            role=MessageRole.ASSISTANT,
            content=output_text,
            raw={"out": output_text, "completion_kwargs": completion_kwargs},
            token_usage=TokenUsage(
                input_tokens=len(out[0].prompt_token_ids),
                output_tokens=len(out[0].outputs[0].token_ids),
            ),
        )

Environment

  • Python version: Python 3.12.10
  • smolagents version: 1.22.0
  • vllm version: 0.11.0
  • OS: macOS 15.6.1

Additional Context

This is a breaking change in vllm that affects backward compatibility. The fix should maintain compatibility with both older and newer versions of vllm if possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions