Skip to content

Commit f35e9c4

Browse files
committed
Add docs for additonal outputs
1 parent 5e605ca commit f35e9c4

File tree

2 files changed

+112
-0
lines changed

2 files changed

+112
-0
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,11 @@ you need to specify a different `shm-region-prefix-name` for each server. See
203203
[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
204204
for more information.
205205

206+
## Additional vLLM outputs
207+
208+
Additional vLLM outputs may be requested optionally on a per-request basis. See
209+
[this docs](docs/additional_outputs.md) for more information.
210+
206211
## Triton Metrics
207212
Starting with the 24.08 release of Triton, users can now obtain specific
208213
vLLM metrics by querying the Triton metrics endpoint (see complete vLLM metrics

docs/additional_outputs.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
<!--
2+
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Additional Outputs from vLLM
30+
31+
The vLLM backend supports sending additional outputs from vLLM on top of the
32+
usual `text_output` when requested.
33+
34+
All additional outputs are disabled by default and they need to be enabled on a
35+
per-request basis. If enabled, the corresponding output tensor will be set for
36+
all responses from the request.
37+
38+
## Supported Additional Outputs
39+
40+
### Finish Reason
41+
42+
The reason why the sequence is finished. See
43+
[here](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L26)
44+
for more details.
45+
46+
To enable, set `output_finish_reason` input tensor to `True`. The reason will be
47+
sent as a string on the `finish_reason` output tensor.
48+
49+
Supported since r24.11.
50+
51+
### Cumulative Log Probabilities
52+
53+
The cumulative log probability of the generated output text. See
54+
[here](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L22)
55+
for more details.
56+
57+
To enable, set `output_cumulative_logprob` input tensor to `True`. The floating
58+
point value will be sent on the `cumulative_logprob` output tensor.
59+
60+
Supported since r24.11.
61+
62+
### Number of token IDs
63+
64+
The number of token IDs of the generated output text sent on this response. It
65+
is the difference in length of the token IDs generated from the last response to
66+
this response. If this is the first response, the last response length is
67+
presumed to be zero. See
68+
[here](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L21)
69+
for more details on the token IDs of the generated output text.
70+
71+
To enable, set `output_num_token_ids` input tensor to `True`. The unsigned
72+
integer value will be sent on the `num_token_ids` output tensor.
73+
74+
Supported since r24.11.
75+
76+
## Examples
77+
78+
### Add Finish Reason to Outputs
79+
80+
```python
81+
import numpy as np
82+
import tritonclient.grpc as grpcclient
83+
84+
inputs = []
85+
86+
inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
87+
inputs[-1].set_data_from_numpy(
88+
np.array(["example prompt".encode("utf-8")], dtype=np.object_)
89+
)
90+
91+
inputs.append(grpcclient.InferInput("output_finish_reason", [1], "BOOL"))
92+
inputs[-1].set_data_from_numpy(np.array([True], dtype=bool))
93+
94+
def callback(result, error):
95+
...
96+
print(result.as_numpy(name="finish_reason"))
97+
98+
with grpcclient.InferenceServerClient("localhost:8001") as client:
99+
client.start_stream(callback)
100+
client.async_stream_infer("vLLM_model_name", inputs=inputs, ...)
101+
client.stop_stream()
102+
```
103+
104+
## Notes
105+
106+
* Enabling additional outputs may impact performance, only add additional
107+
outputs when necessary.

0 commit comments

Comments
 (0)