Skip to content

Support OpenAI Responses APIΒ #306

@Xunzhuo

Description

@Xunzhuo

Is your feature request related to a problem? Please describe.

The semantic router currently supports the OpenAI Chat Completions API (/v1/chat/completions) for routing and processing LLM requests. However, OpenAI and Azure OpenAI have introduced a new Responses API (/v1/responses) that provides a more powerful, stateful API experience. This new API brings together the best capabilities from both the chat completions and assistants APIs in a unified interface.

The Responses API offers several advantages over the traditional Chat Completions API:

  • Stateful conversations: Built-in conversation state management with response chaining via previous_response_id
  • Advanced tool support: Native support for code interpreter, function calling, image generation, and MCP (Model Context Protocol) servers
  • Background tasks: Asynchronous processing for long-running tasks with polling support
  • Enhanced streaming: Better streaming capabilities with resumable streams and sequence tracking
  • File handling: Direct support for file inputs (PDFs, images, etc.) with automatic upload to containers
  • Reasoning models: First-class support for reasoning models (o1, o3, o4-mini) with encrypted reasoning items

Currently, users who want to leverage these advanced capabilities cannot route their requests through the semantic router, limiting the router's applicability for modern LLM workflows that require stateful interactions, advanced tooling, or reasoning capabilities.

Describe the solution you'd like

Add support for the OpenAI Responses API to the semantic router, enabling intelligent routing and classification for Responses API requests while preserving all the advanced features of the API.

Key implementation requirements:

  1. New endpoint support: Handle POST /v1/responses and GET /v1/responses/{response_id} endpoints

  2. Request parsing: Parse Responses API request format including:

    • input field (replaces messages)
    • previous_response_id for conversation chaining
    • tools with extended types (code_interpreter, image_generation, mcp)
    • background mode flag
    • stream parameter with sequence tracking
    • store parameter for stateful/stateless modes
  3. Semantic routing integration: Apply intent classification to Responses API requests:

    • Extract user content from input field (which can be text, messages, or mixed content)
    • Classify intent and route to appropriate models
    • Preserve conversation context when using previous_response_id
  4. Response handling: Process Responses API responses including:

    • Response object format with output array
    • Streaming events with sequence_number tracking
    • Background task status (queued, in_progress, completed)
    • VSR (vLLM Semantic Router) headers injection for routing metadata
  5. Feature preservation: Ensure all Responses API features work through the router:

    • Function calling and tool execution
    • Code interpreter with container management
    • Image generation and editing
    • MCP server integration
    • File uploads and processing
    • Reasoning model support with encrypted reasoning items
  6. Backward compatibility: Maintain full support for existing Chat Completions API while adding Responses API support

Example usage after implementation:

# Client using Responses API through semantic router
from openai import OpenAI

client = OpenAI(
    base_url="http://semantic-router:8801/openai/v1/",
    api_key="your-key"
)

# Router will classify intent and select best model
response = client.responses.create(
    model="auto",  # Router's intelligent model selection
    input="Solve the equation 3x + 11 = 14 using code",
    tools=[{"type": "code_interpreter", "container": {"type": "auto"}}]
)

# Chain responses with context preservation
second_response = client.responses.create(
    model="auto",
    previous_response_id=response.id,
    input="Now explain the solution step by step"
)

Additional context

Related documentation:

Benefits for semantic router users:

  • Enable routing for advanced LLM workflows (agents, code execution, multi-turn reasoning)
  • Support modern AI applications that require stateful conversations
  • Provide intelligent model selection for reasoning tasks and tool-heavy workloads
  • Maintain semantic router's value proposition (cost optimization, latency reduction) for next-generation LLM APIs

Implementation considerations:

  • The Responses API uses different request/response structures than Chat Completions
  • Conversation state management may require additional caching strategies
  • Background tasks introduce asynchronous processing patterns
  • Tool execution (especially code interpreter and MCP) may need special handling in the routing layer
  • Streaming format differs from SSE used in Chat Completions

Priority justification:
As the Responses API becomes the recommended approach for building advanced LLM applications (especially with reasoning models and agents), supporting it in the semantic router is crucial for maintaining relevance and enabling users to leverage the router's intelligent routing capabilities with modern LLM workflows.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions