Skip to content

Commit af899d2

Browse files
authored
[TRTLLM-9860][doc] Add docs and examples for Responses API (#9946)
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
1 parent f2aee0d commit af899d2

File tree

11 files changed

+590
-13
lines changed

11 files changed

+590
-13
lines changed

docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ For the full syntax and argument descriptions, refer to :ref:`syntax`.
3434
Inference Endpoints
3535
-------------------
3636

37-
After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_ for examples in the following sections.
37+
After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 <https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0>`_ for examples in the following sections.
3838

3939
Chat API
4040
~~~~~~~~
@@ -66,6 +66,24 @@ Another example uses ``curl``:
6666
:language: bash
6767
:linenos:
6868

69+
Responses API
70+
~~~~~~~~~~~~~~~
71+
72+
You can query Responses API with any http clients, a typical example is OpenAI Python client:
73+
74+
.. literalinclude:: ../../../../examples/serve/openai_responses_client.py
75+
:language: python
76+
:linenos:
77+
78+
Another example uses ``curl``:
79+
80+
.. literalinclude:: ../../../../examples/serve/curl_responses_client.sh
81+
:language: bash
82+
:linenos:
83+
84+
85+
More openai compatible examples can be found in the `compatibility examples <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/serve/compatibility>`_ directory.
86+
6987
Multimodal Serving
7088
~~~~~~~~~~~~~~~~~~
7189

examples/serve/compatibility/README.md

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,27 @@ python examples/serve/compatibility/chat_completions/example_01_basic_chat.py
3434

3535
### 📋 Complete Example List
3636

37-
All examples demonstrate the `/v1/chat/completions` endpoint:
37+
#### Chat Completions (`/v1/chat/completions`)
3838

3939
| Example | File | Description |
4040
|---------|------|-------------|
41-
| **01** | `example_01_basic_chat.py` | Basic non-streaming chat completion |
42-
| **02** | `example_02_streaming_chat.py` | Streaming responses with real-time delivery |
43-
| **03** | `example_03_multi_turn_conversation.py` | Multi-turn conversation with context |
44-
| **04** | `example_04_streaming_with_usage.py` | Streaming with continuous token usage stats |
45-
| **05** | `example_05_json_mode.py` | Structured output with JSON schema |
46-
| **06** | `example_06_tool_calling.py` | Function/tool calling with tools |
47-
| **07** | `example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters |
41+
| **01** | `chat_completions/example_01_basic_chat.py` | Basic non-streaming chat completion |
42+
| **02** | `chat_completions/example_02_streaming_chat.py` | Streaming responses with real-time delivery |
43+
| **03** | `chat_completions/example_03_multi_turn_conversation.py` | Multi-turn conversation with context |
44+
| **04** | `chat_completions/example_04_streaming_with_usage.py` | Streaming with continuous token usage stats |
45+
| **05** | `chat_completions/example_05_json_mode.py` | Structured output with JSON schema |
46+
| **06** | `chat_completions/example_06_tool_calling.py` | Function/tool calling with tools |
47+
| **07** | `chat_completions/example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters |
48+
49+
#### Responses (`/v1/responses`)
50+
51+
| Example | File | Description |
52+
|---------|------|-------------|
53+
| **01** | `responses/example_01_basic_chat.py` | Basic non-streaming response |
54+
| **02** | `responses/example_02_streaming_chat.py` | Streaming with event handling |
55+
| **03** | `responses/example_03_multi_turn_conversation.py` | Multi-turn using `previous_response_id` |
56+
| **04** | `responses/example_04_json_mode.py` | Structured output with JSON schema |
57+
| **05** | `responses/example_05_tool_calling.py` | Function/tool calling with tools |
4858

4959
## Configuration
5060

@@ -68,8 +78,8 @@ client = OpenAI(
6878

6979
Some examples require specific model capabilities:
7080

71-
| Example | Model Requirement |
81+
| Feature | Model Requirement |
7282
|---------|------------------|
73-
| 05 (JSON Mode) | xgrammar support |
74-
| 06 (Tool Calling) | Tool-capable model (Qwen3, GPT OSS) |
83+
| JSON Mode | xgrammar support |
84+
| Tool Calling | Tool-capable model (Qwen3, GPT-OSS, Kimi K2) |
7585
| Others | Any model |
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Responses API Examples
2+
3+
Examples for the `/v1/responses` endpoint. All examples in this directory use the Responses API, demonstrating features such as streaming, tool/function calling, and multi-turn dialogue.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Run the basic example
9+
python example_01_basic_chat.py
10+
```
11+
12+
## Examples Overview
13+
14+
### Basic Examples
15+
16+
1. **`example_01_basic_chat.py`** - Start here!
17+
- Simple request/response
18+
- Non-streaming mode
19+
- Uses `input` parameter for user message
20+
21+
2. **`example_02_streaming_chat.py`** - Real-time responses
22+
- Stream tokens as generated
23+
- Handles various event types (`response.created`, `response.output_text.delta`, etc.)
24+
- Server-Sent Events (SSE)
25+
26+
3. **`example_03_multi_turn_conversation.py`** - Context management
27+
- Multiple conversation turns
28+
- Uses `previous_response_id` to maintain context
29+
- Follow-up questions without resending history
30+
31+
### Advanced Examples
32+
33+
4. **`example_04_json_mode.py`** - Structured output
34+
- JSON schema validation via `text.format`
35+
- Structured data extraction
36+
- Requires xgrammar support
37+
38+
5. **`example_05_tool_calling.py`** - Function calling
39+
- External tool integration
40+
- Function definitions with `tools` parameter
41+
- Tool result handling with `function_call_output`
42+
- Requires compatible model (Qwen3, GPT-OSS, Kimi K2)
43+
44+
## Key Concepts
45+
46+
### Non-Streaming vs Streaming
47+
48+
**Non-Streaming** (`stream=False`):
49+
- Wait for complete response
50+
- Single response object
51+
- Simple to use
52+
53+
**Streaming** (`stream=True`):
54+
- Tokens delivered as generated
55+
- Better perceived latency
56+
- Server-Sent Events (SSE)
57+
58+
### Multi-turn Context
59+
60+
Use `previous_response_id` to continue conversations:
61+
```python
62+
# First turn
63+
response1 = client.responses.create(
64+
model=model,
65+
input="What is 15 multiplied by 23?",
66+
)
67+
68+
# Second turn - references previous response
69+
response2 = client.responses.create(
70+
model=model,
71+
input="Now divide that result by 5",
72+
previous_response_id=response1.id,
73+
)
74+
```
75+
76+
### Tool Calling
77+
78+
Define functions the model can call:
79+
```python
80+
tools = [{
81+
"name": "get_weather",
82+
"type": "function",
83+
"description": "Get the current weather in a location",
84+
"parameters": {
85+
"type": "object",
86+
"properties": {
87+
"location": {"type": "string"},
88+
},
89+
"required": ["location"],
90+
}
91+
}]
92+
```
93+
94+
## Model Requirements
95+
96+
| Feature | Requirement |
97+
|---------|-------------|
98+
| Basic chat | Any model |
99+
| Streaming | Any model |
100+
| Multi-turn | Any model |
101+
| JSON mode | xgrammar support |
102+
| Tool calling | Compatible model (Qwen3, GPT-OSS, Kimi K2) |
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
#!/usr/bin/env python3
17+
"""Example 1: Basic Non-Streaming Responses.
18+
19+
Demonstrates a simple responses request with the OpenAI-compatible API.
20+
"""
21+
22+
from openai import OpenAI
23+
24+
# Initialize the client
25+
client = OpenAI(
26+
base_url="http://localhost:8000/v1",
27+
api_key="tensorrt_llm",
28+
)
29+
30+
# Get the model name from the server
31+
models = client.models.list()
32+
model = models.data[0].id
33+
34+
print("=" * 80)
35+
print("Example 1: Basic Non-Streaming Responses")
36+
print("=" * 80)
37+
print()
38+
39+
# Create a simple responses request
40+
response = client.responses.create(
41+
model=model,
42+
input="What is the capital of France?",
43+
max_output_tokens=4096,
44+
)
45+
46+
# Print the response
47+
print("Response:")
48+
print(f"Content: {response.output_text}")
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
#!/usr/bin/env python3
17+
"""Example 2: Streaming Responses.
18+
19+
Demonstrates streaming responses with real-time token delivery.
20+
"""
21+
22+
from openai import OpenAI
23+
24+
25+
def print_streaming_responses_item(item, show_events=True):
26+
event_type = getattr(item, "type", "")
27+
28+
if event_type == "response.created":
29+
if show_events:
30+
print(f"[Response Created: {getattr(item.response, 'id', 'unknown')}]")
31+
elif event_type == "response.in_progress":
32+
if show_events:
33+
print("[Response In Progress]")
34+
elif event_type == "response.output_item.added":
35+
if show_events:
36+
item_type = getattr(item.item, "type", "unknown")
37+
item_id = getattr(item.item, "id", "unknown")
38+
print(f"\n[Output Item Added: {item_type} (id: {item_id})]")
39+
elif event_type == "response.content_part.added":
40+
if show_events:
41+
part_type = getattr(item.part, "type", "unknown")
42+
print(f"[Content Part Added: {part_type}]")
43+
elif event_type == "response.reasoning_text.delta":
44+
print(item.delta, end="", flush=True)
45+
elif event_type == "response.output_text.delta":
46+
print(item.delta, end="", flush=True)
47+
elif event_type == "response.reasoning_text.done":
48+
if show_events:
49+
print(f"\n[Reasoning Text Done: {len(item.text)} chars]")
50+
elif event_type == "response.output_text.done":
51+
if show_events:
52+
print(f"\n[Output Text Done: {len(item.text)} chars]")
53+
elif event_type == "response.content_part.done":
54+
if show_events:
55+
part_type = getattr(item.part, "type", "unknown")
56+
print(f"[Content Part Done: {part_type}]")
57+
elif event_type == "response.output_item.done":
58+
if show_events:
59+
item_type = getattr(item.item, "type", "unknown")
60+
item_id = getattr(item.item, "id", "unknown")
61+
print(f"[Output Item Done: {item_type} (id: {item_id})]")
62+
elif event_type == "response.completed":
63+
if show_events:
64+
print("\n[Response Completed]")
65+
66+
67+
# Initialize the client
68+
client = OpenAI(
69+
base_url="http://localhost:8000/v1",
70+
api_key="tensorrt_llm",
71+
)
72+
73+
# Get the model name from the server
74+
models = client.models.list()
75+
model = models.data[0].id
76+
77+
print("=" * 80)
78+
print("Example 2: Streaming Responses")
79+
print("=" * 80)
80+
print()
81+
82+
print("Prompt: Write a haiku about artificial intelligence\n")
83+
84+
# Create a streaming responses
85+
stream = client.responses.create(
86+
model=model,
87+
input="Write a haiku about artificial intelligence",
88+
max_output_tokens=4096,
89+
stream=True,
90+
)
91+
92+
# Print tokens as they arrive
93+
print("Response (streaming):")
94+
print("Assistant: ", end="", flush=True)
95+
96+
current_state = "none"
97+
for event in stream:
98+
print_streaming_responses_item(event)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
#!/usr/bin/env python3
17+
"""Example 3: Multi-turn Conversation.
18+
19+
Demonstrates maintaining conversation context across multiple turns.
20+
"""
21+
22+
from openai import OpenAI
23+
24+
# Initialize the client
25+
client = OpenAI(
26+
base_url="http://localhost:8000/v1",
27+
api_key="tensorrt_llm",
28+
)
29+
30+
# Get the model name from the server
31+
models = client.models.list()
32+
model = models.data[0].id
33+
34+
print("=" * 80)
35+
print("Example 3: Multi-turn Conversation")
36+
print("=" * 80)
37+
print()
38+
39+
# First turn: User asks a question
40+
print("USER: What is 15 multiplied by 23?")
41+
42+
response1 = client.responses.create(
43+
model=model,
44+
input="What is 15 multiplied by 23?",
45+
max_output_tokens=4096,
46+
)
47+
48+
assistant_reply_1 = response1.output_text
49+
print(f"ASSISTANT: {assistant_reply_1}\n")
50+
51+
# Second turn: User asks a follow-up question
52+
print("USER: Now divide that result by 5")
53+
54+
# No context need to be provided for the second turn, only include the previous response id
55+
response2 = client.responses.create(
56+
model=model,
57+
input="Now divide that result by 5",
58+
max_output_tokens=4096,
59+
previous_response_id=response1.id,
60+
)
61+
62+
assistant_reply_2 = response2.output_text
63+
print(f"ASSISTANT: {assistant_reply_2}")

0 commit comments

Comments
 (0)