-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Is your feature request related to a problem? Please describe.
The semantic router currently supports the OpenAI Chat Completions API (/v1/chat/completions
) for routing and processing LLM requests. However, OpenAI and Azure OpenAI have introduced a new Responses API (/v1/responses
) that provides a more powerful, stateful API experience. This new API brings together the best capabilities from both the chat completions and assistants APIs in a unified interface.
The Responses API offers several advantages over the traditional Chat Completions API:
- Stateful conversations: Built-in conversation state management with response chaining via
previous_response_id
- Advanced tool support: Native support for code interpreter, function calling, image generation, and MCP (Model Context Protocol) servers
- Background tasks: Asynchronous processing for long-running tasks with polling support
- Enhanced streaming: Better streaming capabilities with resumable streams and sequence tracking
- File handling: Direct support for file inputs (PDFs, images, etc.) with automatic upload to containers
- Reasoning models: First-class support for reasoning models (o1, o3, o4-mini) with encrypted reasoning items
Currently, users who want to leverage these advanced capabilities cannot route their requests through the semantic router, limiting the router's applicability for modern LLM workflows that require stateful interactions, advanced tooling, or reasoning capabilities.
Describe the solution you'd like
Add support for the OpenAI Responses API to the semantic router, enabling intelligent routing and classification for Responses API requests while preserving all the advanced features of the API.
Key implementation requirements:
-
New endpoint support: Handle
POST /v1/responses
andGET /v1/responses/{response_id}
endpoints -
Request parsing: Parse Responses API request format including:
input
field (replacesmessages
)previous_response_id
for conversation chainingtools
with extended types (code_interpreter, image_generation, mcp)background
mode flagstream
parameter with sequence trackingstore
parameter for stateful/stateless modes
-
Semantic routing integration: Apply intent classification to Responses API requests:
- Extract user content from
input
field (which can be text, messages, or mixed content) - Classify intent and route to appropriate models
- Preserve conversation context when using
previous_response_id
- Extract user content from
-
Response handling: Process Responses API responses including:
- Response object format with
output
array - Streaming events with
sequence_number
tracking - Background task status (
queued
,in_progress
,completed
) - VSR (vLLM Semantic Router) headers injection for routing metadata
- Response object format with
-
Feature preservation: Ensure all Responses API features work through the router:
- Function calling and tool execution
- Code interpreter with container management
- Image generation and editing
- MCP server integration
- File uploads and processing
- Reasoning model support with encrypted reasoning items
-
Backward compatibility: Maintain full support for existing Chat Completions API while adding Responses API support
Example usage after implementation:
# Client using Responses API through semantic router
from openai import OpenAI
client = OpenAI(
base_url="http://semantic-router:8801/openai/v1/",
api_key="your-key"
)
# Router will classify intent and select best model
response = client.responses.create(
model="auto", # Router's intelligent model selection
input="Solve the equation 3x + 11 = 14 using code",
tools=[{"type": "code_interpreter", "container": {"type": "auto"}}]
)
# Chain responses with context preservation
second_response = client.responses.create(
model="auto",
previous_response_id=response.id,
input="Now explain the solution step by step"
)
Additional context
Related documentation:
Benefits for semantic router users:
- Enable routing for advanced LLM workflows (agents, code execution, multi-turn reasoning)
- Support modern AI applications that require stateful conversations
- Provide intelligent model selection for reasoning tasks and tool-heavy workloads
- Maintain semantic router's value proposition (cost optimization, latency reduction) for next-generation LLM APIs
Implementation considerations:
- The Responses API uses different request/response structures than Chat Completions
- Conversation state management may require additional caching strategies
- Background tasks introduce asynchronous processing patterns
- Tool execution (especially code interpreter and MCP) may need special handling in the routing layer
- Streaming format differs from SSE used in Chat Completions
Priority justification:
As the Responses API becomes the recommended approach for building advanced LLM applications (especially with reasoning models and agents), supporting it in the semantic router is crucial for maintaining relevance and enabling users to leverage the router's intelligent routing capabilities with modern LLM workflows.