NVIDIA
diff --git a/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 19 additions & 1 deletion b/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 19 additions & 1 deletion
diff --git a/‎examples/serve/compatibility/README.md‎
Lines changed: 21 additions & 11 deletions b/‎examples/serve/compatibility/README.md‎
Lines changed: 21 additions & 11 deletions
diff --git a/‎examples/serve/compatibility/responses/README.md‎
Lines changed: 102 additions & 0 deletions b/‎examples/serve/compatibility/responses/README.md‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎examples/serve/compatibility/responses/example_01_basic_chat.py‎
Lines changed: 48 additions & 0 deletions b/‎examples/serve/compatibility/responses/example_01_basic_chat.py‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎examples/serve/compatibility/responses/example_02_streaming_chat.py‎
Lines changed: 98 additions & 0 deletions b/‎examples/serve/compatibility/responses/example_02_streaming_chat.py‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎examples/serve/compatibility/responses/example_03_multi_turn_conversation.py‎
Lines changed: 63 additions & 0 deletions b/‎examples/serve/compatibility/responses/example_03_multi_turn_conversation.py‎
Lines changed: 63 additions & 0 deletions
@@ -34,7 +34,7 @@ For the full syntax and argument descriptions, refer to :ref:`syntax`.
 Inference Endpoints
 -------------------
 
-After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_ for examples in the following sections.
+After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_ for examples in the following sections.
 
 Chat API
 ~~~~~~~~
@@ -66,6 +66,24 @@ Another example uses ``curl``:
     :language: bash
     :linenos:
 
+Responses API
+~~~~~~~~~~~~~~~
+
+You can query Responses API with any http clients, a typical example is OpenAI Python client:
+
+.. literalinclude:: ../../../../examples/serve/openai_responses_client.py
+    :language: python
+    :linenos:
+
+Another example uses ``curl``:
+
+.. literalinclude:: ../../../../examples/serve/curl_responses_client.sh
+    :language: bash
+    :linenos:
+
+
+More openai compatible examples can be found in the `compatibility examples <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/serve/compatibility>`_ directory.
+
 Multimodal Serving
 ~~~~~~~~~~~~~~~~~~
 
 
@@ -34,17 +34,27 @@ python examples/serve/compatibility/chat_completions/example_01_basic_chat.py
 
 ### 📋 Complete Example List
 
-All examples demonstrate the `/v1/chat/completions` endpoint:
+#### Chat Completions (`/v1/chat/completions`)
 
 | Example | File | Description |
 |---------|------|-------------|
-| **01** | `example_01_basic_chat.py` | Basic non-streaming chat completion |
-| **02** | `example_02_streaming_chat.py` | Streaming responses with real-time delivery |
-| **03** | `example_03_multi_turn_conversation.py` | Multi-turn conversation with context |
-| **04** | `example_04_streaming_with_usage.py` | Streaming with continuous token usage stats |
-| **05** | `example_05_json_mode.py` | Structured output with JSON schema |
-| **06** | `example_06_tool_calling.py` | Function/tool calling with tools |
-| **07** | `example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters |
+| **01** | `chat_completions/example_01_basic_chat.py` | Basic non-streaming chat completion |
+| **02** | `chat_completions/example_02_streaming_chat.py` | Streaming responses with real-time delivery |
+| **03** | `chat_completions/example_03_multi_turn_conversation.py` | Multi-turn conversation with context |
+| **04** | `chat_completions/example_04_streaming_with_usage.py` | Streaming with continuous token usage stats |
+| **05** | `chat_completions/example_05_json_mode.py` | Structured output with JSON schema |
+| **06** | `chat_completions/example_06_tool_calling.py` | Function/tool calling with tools |
+| **07** | `chat_completions/example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters |
+
+#### Responses (`/v1/responses`)
+
+| Example | File | Description |
+|---------|------|-------------|
+| **01** | `responses/example_01_basic_chat.py` | Basic non-streaming response |
+| **02** | `responses/example_02_streaming_chat.py` | Streaming with event handling |
+| **03** | `responses/example_03_multi_turn_conversation.py` | Multi-turn using `previous_response_id` |
+| **04** | `responses/example_04_json_mode.py` | Structured output with JSON schema |
+| **05** | `responses/example_05_tool_calling.py` | Function/tool calling with tools |
 
 ## Configuration
 
@@ -68,8 +78,8 @@ client = OpenAI(
 
 Some examples require specific model capabilities:
 
-| Example | Model Requirement |
+| Feature | Model Requirement |
 |---------|------------------|
-| 05 (JSON Mode) | xgrammar support |
-| 06 (Tool Calling) | Tool-capable model (Qwen3, GPT OSS) |
+| JSON Mode | xgrammar support |
+| Tool Calling | Tool-capable model (Qwen3, GPT-OSS, Kimi K2) |
 | Others | Any model |
@@ -0,0 +1,102 @@
+# Responses API Examples
+
+Examples for the `/v1/responses` endpoint. All examples in this directory use the Responses API, demonstrating features such as streaming, tool/function calling, and multi-turn dialogue.
+
+## Quick Start
+
+```bash
+# Run the basic example
+python example_01_basic_chat.py
+```
+
+## Examples Overview
+
+### Basic Examples
+
+1. **`example_01_basic_chat.py`** - Start here!
+   - Simple request/response
+   - Non-streaming mode
+   - Uses `input` parameter for user message
+
+2. **`example_02_streaming_chat.py`** - Real-time responses
+   - Stream tokens as generated
+   - Handles various event types (`response.created`, `response.output_text.delta`, etc.)
+   - Server-Sent Events (SSE)
+
+3. **`example_03_multi_turn_conversation.py`** - Context management
+   - Multiple conversation turns
+   - Uses `previous_response_id` to maintain context
+   - Follow-up questions without resending history
+
+### Advanced Examples
+
+4. **`example_04_json_mode.py`** - Structured output
+   - JSON schema validation via `text.format`
+   - Structured data extraction
+   - Requires xgrammar support
+
+5. **`example_05_tool_calling.py`** - Function calling
+   - External tool integration
+   - Function definitions with `tools` parameter
+   - Tool result handling with `function_call_output`
+   - Requires compatible model (Qwen3, GPT-OSS, Kimi K2)
+
+## Key Concepts
+
+### Non-Streaming vs Streaming
+
+**Non-Streaming** (`stream=False`):
+- Wait for complete response
+- Single response object
+- Simple to use
+
+**Streaming** (`stream=True`):
+- Tokens delivered as generated
+- Better perceived latency
+- Server-Sent Events (SSE)
+
+### Multi-turn Context
+
+Use `previous_response_id` to continue conversations:
+```python
+# First turn
+response1 = client.responses.create(
+    model=model,
+    input="What is 15 multiplied by 23?",
+)
+
+# Second turn - references previous response
+response2 = client.responses.create(
+    model=model,
+    input="Now divide that result by 5",
+    previous_response_id=response1.id,
+)
+```
+
+### Tool Calling
+
+Define functions the model can call:
+```python
+tools = [{
+    "name": "get_weather",
+    "type": "function",
+    "description": "Get the current weather in a location",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "location": {"type": "string"},
+        },
+        "required": ["location"],
+    }
+}]
+```
+
+## Model Requirements
+
+| Feature | Requirement |
+|---------|-------------|
+| Basic chat | Any model |
+| Streaming | Any model |
+| Multi-turn | Any model |
+| JSON mode | xgrammar support |
+| Tool calling | Compatible model (Qwen3, GPT-OSS, Kimi K2) |
@@ -0,0 +1,48 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 1: Basic Non-Streaming Responses.
+
+Demonstrates a simple responses request with the OpenAI-compatible API.
+"""
+
+from openai import OpenAI
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 1: Basic Non-Streaming Responses")
+print("=" * 80)
+print()
+
+# Create a simple responses request
+response = client.responses.create(
+    model=model,
+    input="What is the capital of France?",
+    max_output_tokens=4096,
+)
+
+# Print the response
+print("Response:")
+print(f"Content: {response.output_text}")
@@ -0,0 +1,98 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 2: Streaming Responses.
+
+Demonstrates streaming responses with real-time token delivery.
+"""
+
+from openai import OpenAI
+
+
+def print_streaming_responses_item(item, show_events=True):
+    event_type = getattr(item, "type", "")
+
+    if event_type == "response.created":
+        if show_events:
+            print(f"[Response Created: {getattr(item.response, 'id', 'unknown')}]")
+    elif event_type == "response.in_progress":
+        if show_events:
+            print("[Response In Progress]")
+    elif event_type == "response.output_item.added":
+        if show_events:
+            item_type = getattr(item.item, "type", "unknown")
+            item_id = getattr(item.item, "id", "unknown")
+            print(f"\n[Output Item Added: {item_type} (id: {item_id})]")
+    elif event_type == "response.content_part.added":
+        if show_events:
+            part_type = getattr(item.part, "type", "unknown")
+            print(f"[Content Part Added: {part_type}]")
+    elif event_type == "response.reasoning_text.delta":
+        print(item.delta, end="", flush=True)
+    elif event_type == "response.output_text.delta":
+        print(item.delta, end="", flush=True)
+    elif event_type == "response.reasoning_text.done":
+        if show_events:
+            print(f"\n[Reasoning Text Done: {len(item.text)} chars]")
+    elif event_type == "response.output_text.done":
+        if show_events:
+            print(f"\n[Output Text Done: {len(item.text)} chars]")
+    elif event_type == "response.content_part.done":
+        if show_events:
+            part_type = getattr(item.part, "type", "unknown")
+            print(f"[Content Part Done: {part_type}]")
+    elif event_type == "response.output_item.done":
+        if show_events:
+            item_type = getattr(item.item, "type", "unknown")
+            item_id = getattr(item.item, "id", "unknown")
+            print(f"[Output Item Done: {item_type} (id: {item_id})]")
+    elif event_type == "response.completed":
+        if show_events:
+            print("\n[Response Completed]")
+
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 2: Streaming Responses")
+print("=" * 80)
+print()
+
+print("Prompt: Write a haiku about artificial intelligence\n")
+
+# Create a streaming responses
+stream = client.responses.create(
+    model=model,
+    input="Write a haiku about artificial intelligence",
+    max_output_tokens=4096,
+    stream=True,
+)
+
+# Print tokens as they arrive
+print("Response (streaming):")
+print("Assistant: ", end="", flush=True)
+
+current_state = "none"
+for event in stream:
+    print_streaming_responses_item(event)
@@ -0,0 +1,63 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 3: Multi-turn Conversation.
+
+Demonstrates maintaining conversation context across multiple turns.
+"""
+
+from openai import OpenAI
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 3: Multi-turn Conversation")
+print("=" * 80)
+print()
+
+# First turn: User asks a question
+print("USER: What is 15 multiplied by 23?")
+
+response1 = client.responses.create(
+    model=model,
+    input="What is 15 multiplied by 23?",
+    max_output_tokens=4096,
+)
+
+assistant_reply_1 = response1.output_text
+print(f"ASSISTANT: {assistant_reply_1}\n")
+
+# Second turn: User asks a follow-up question
+print("USER: Now divide that result by 5")
+
+# No context need to be provided for the second turn, only include the previous response id
+response2 = client.responses.create(
+    model=model,
+    input="Now divide that result by 5",
+    max_output_tokens=4096,
+    previous_response_id=response1.id,
+)
+
+assistant_reply_2 = response2.output_text
+print(f"ASSISTANT: {assistant_reply_2}")