Skip to content

Commit c8a1bca

Browse files
pskiran1yinggeh
andauthored
ci: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models (#8458)
Co-authored-by: Yingge He <[email protected]>
1 parent ebf8bff commit c8a1bca

File tree

7 files changed

+671
-7
lines changed

7 files changed

+671
-7
lines changed

docs/user_guide/decoupled_models.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -95,7 +95,15 @@ your application should be cognizant that the callback function you registered w
9595
`TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
9696
each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)
9797

98-
### Knowing When a Decoupled Inference Request is Complete
98+
### Using Decoupled Models in Ensembles
99+
100+
When using decoupled models within an [ensemble pipeline](ensemble_models.md), you may encounter unbounded memory growth if the decoupled model produces responses faster than downstream models can consume them.
101+
102+
To prevent unbounded memory growth in this scenario, consider using the `max_inflight_requests` configuration field. This field limits the maximum number of concurrent inflight requests permitted at each ensemble step for each inference request.
103+
104+
For more details and examples, see [Managing Memory Usage in Ensemble Models](ensemble_models.md#managing-memory-usage-in-ensemble-models).
105+
106+
## Knowing When a Decoupled Inference Request is Complete
99107

100108
An inference request is considered complete when a response containing the
101109
`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received from a model/backend.

docs/user_guide/ensemble_models.md

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -183,6 +183,67 @@ performance, you can use
183183
[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
184184
to find the optimal model configurations.
185185

186+
## Managing Memory Usage in Ensemble Models
187+
188+
An *inflight request* refers to an intermediate request generated by an upstream model that is queued and held in memory until it is processed by a downstream model within an ensemble pipeline. When upstream models process requests significantly faster than downstream models, these in-flight requests can accumulate and potentially lead to unbounded memory growth. This problem occurs when there is a speed mismatch between different steps in the pipeline and is particularly common in *decoupled models* that produce multiple responses per request more quickly than downstream models can consume.
189+
190+
Consider an example ensemble model with two steps where the upstream model is 10× faster:
191+
1. **Preprocessing model**: Produces 100 preprocessed requests/sec
192+
2. **Inference model**: Consumes 10 requests/sec
193+
194+
Without backpressure, requests accumulate in the pipeline faster than they can be processed, eventually leading to out-of-memory errors.
195+
196+
The `max_inflight_requests` field in the ensemble configuration sets a limit on the number of concurrent inflight requests permitted at each ensemble step for a single inference request.
197+
When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.
198+
199+
```
200+
ensemble_scheduling {
201+
max_inflight_requests: 16
202+
203+
step [
204+
{
205+
model_name: "dali_preprocess"
206+
model_version: -1
207+
input_map { key: "RAW_IMAGE", value: "IMAGE" }
208+
output_map { key: "PREPROCESSED_IMAGE", value: "preprocessed" }
209+
},
210+
{
211+
model_name: "onnx_inference"
212+
model_version: -1
213+
input_map { key: "INPUT", value: "preprocessed" }
214+
output_map { key: "OUTPUT", value: "RESULT" }
215+
}
216+
]
217+
}
218+
```
219+
220+
**Configuration:**
221+
* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
222+
can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
223+
* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior).
224+
225+
### When to Use This Feature
226+
227+
Use `max_inflight_requests` when your ensemble pipeline includes:
228+
* **Streaming or decoupled models**: When models produce multiple responses per request more quickly than downstream steps can process them.
229+
* **Memory constraints**: Risk of unbounded memory growth from accumulating requests.
230+
231+
### Choosing the Right Value
232+
233+
The optimal value depends on your specific deployment, including batch size, request rate, available memory, and throughput.
234+
235+
* **Too low**: The producer step is blocked too often, which underutilizes faster models.
236+
* **Too high**: Memory usage increases, diminishing the effectiveness of backpressure.
237+
* **Recommendation**: Start with a small value and adjust it based on memory usage and throughput monitoring.
238+
239+
### Performance Considerations
240+
241+
* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
242+
no synchronization overhead is incurred.
243+
* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.
244+
245+
**Note**: This blocking does not cancel or internally time out intermediate requests, but clients may experience increased end-to-end latency.
246+
186247
## Additional Resources
187248

188249
You can find additional end-to-end ensemble examples in the links below:
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Redistribution and use in source and binary forms, with or without
4+
# modification, are permitted provided that the following conditions
5+
# are met:
6+
# * Redistributions of source code must retain the above copyright
7+
# notice, this list of conditions and the following disclaimer.
8+
# * Redistributions in binary form must reproduce the above copyright
9+
# notice, this list of conditions and the following disclaimer in the
10+
# documentation and/or other materials provided with the distribution.
11+
# * Neither the name of NVIDIA CORPORATION nor the names of its
12+
# contributors may be used to endorse or promote products derived
13+
# from this software without specific prior written permission.
14+
#
15+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
16+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
18+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
19+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
20+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
21+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
22+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
23+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26+
27+
28+
import numpy as np
29+
import triton_python_backend_utils as pb_utils
30+
31+
32+
class TritonPythonModel:
33+
"""
34+
Decoupled model that produces N responses based on input value.
35+
"""
36+
37+
def execute(self, requests):
38+
for request in requests:
39+
# Get input - number of responses to produce
40+
in_tensor = pb_utils.get_input_tensor_by_name(request, "IN")
41+
count = in_tensor.as_numpy()[0]
42+
43+
response_sender = request.get_response_sender()
44+
45+
# Produce 'count' responses, each with 0.5 as the output value
46+
for i in range(count):
47+
out_tensor = pb_utils.Tensor("OUT", np.array([0.5], dtype=np.float32))
48+
response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
49+
response_sender.send(response)
50+
51+
# Send final flag
52+
response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
53+
54+
return None
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Redistribution and use in source and binary forms, with or without
4+
# modification, are permitted provided that the following conditions
5+
# are met:
6+
# * Redistributions of source code must retain the above copyright
7+
# notice, this list of conditions and the following disclaimer.
8+
# * Redistributions in binary form must reproduce the above copyright
9+
# notice, this list of conditions and the following disclaimer in the
10+
# documentation and/or other materials provided with the distribution.
11+
# * Neither the name of NVIDIA CORPORATION nor the names of its
12+
# contributors may be used to endorse or promote products derived
13+
# from this software without specific prior written permission.
14+
#
15+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
16+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
18+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
19+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
20+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
21+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
22+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
23+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26+
27+
28+
name: "decoupled_producer"
29+
backend: "python"
30+
max_batch_size: 0
31+
32+
input [
33+
{
34+
name: "IN"
35+
data_type: TYPE_INT32
36+
dims: [ 1 ]
37+
}
38+
]
39+
40+
output [
41+
{
42+
name: "OUT"
43+
data_type: TYPE_FP32
44+
dims: [ 1 ]
45+
}
46+
]
47+
48+
instance_group [
49+
{
50+
count: 1
51+
kind: KIND_CPU
52+
}
53+
]
54+
55+
model_transaction_policy {
56+
decoupled: true
57+
}
58+
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Redistribution and use in source and binary forms, with or without
4+
# modification, are permitted provided that the following conditions
5+
# are met:
6+
# * Redistributions of source code must retain the above copyright
7+
# notice, this list of conditions and the following disclaimer.
8+
# * Redistributions in binary form must reproduce the above copyright
9+
# notice, this list of conditions and the following disclaimer in the
10+
# documentation and/or other materials provided with the distribution.
11+
# * Neither the name of NVIDIA CORPORATION nor the names of its
12+
# contributors may be used to endorse or promote products derived
13+
# from this software without specific prior written permission.
14+
#
15+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
16+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
18+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
19+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
20+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
21+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
22+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
23+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26+
27+
28+
platform: "ensemble"
29+
max_batch_size: 0
30+
31+
input [
32+
{
33+
name: "IN"
34+
data_type: TYPE_INT32
35+
dims: [ 1 ]
36+
}
37+
]
38+
39+
output [
40+
{
41+
name: "OUT"
42+
data_type: TYPE_FP32
43+
dims: [ 1 ]
44+
}
45+
]
46+
47+
ensemble_scheduling {
48+
step [
49+
{
50+
model_name: "decoupled_producer"
51+
model_version: -1
52+
input_map {
53+
key: "IN"
54+
value: "IN"
55+
}
56+
output_map {
57+
key: "OUT"
58+
value: "intermediate"
59+
}
60+
},
61+
{
62+
model_name: "slow_consumer"
63+
model_version: -1
64+
input_map {
65+
key: "INPUT0"
66+
value: "intermediate"
67+
}
68+
output_map {
69+
key: "OUTPUT0"
70+
value: "OUT"
71+
}
72+
}
73+
]
74+
}
75+

0 commit comments

Comments
 (0)