ci: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models (#8458)

pskiran1 · yinggeh · web-flow · commit c8a1bca48012 · 2025-11-03T10:56:41.000+05:30
Co-authored-by: Yingge He &lt;157551214+yinggeh@users.noreply.github.com&gt;
diff --git a/docs/user_guide/decoupled_models.md b/docs/user_guide/decoupled_models.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -95,7 +95,15 @@ your application should be cognizant that the callback function you registered w
 `TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
 each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)
 
-### Knowing When a Decoupled Inference Request is Complete
+### Using Decoupled Models in Ensembles
+
+When using decoupled models within an [ensemble pipeline](ensemble_models.md), you may encounter unbounded memory growth if the decoupled model produces responses faster than downstream models can consume them.
+
+To prevent unbounded memory growth in this scenario, consider using the `max_inflight_requests` configuration field. This field limits the maximum number of concurrent inflight requests permitted at each ensemble step for each inference request.
+
+For more details and examples, see [Managing Memory Usage in Ensemble Models](ensemble_models.md#managing-memory-usage-in-ensemble-models).
+
+## Knowing When a Decoupled Inference Request is Complete
 
 An inference request is considered complete when a response containing the
 `TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received from a model/backend.
diff --git a/docs/user_guide/ensemble_models.md b/docs/user_guide/ensemble_models.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -183,6 +183,67 @@ performance, you can use
 [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
 to find the optimal model configurations.
 
+## Managing Memory Usage in Ensemble Models
+
+An *inflight request* refers to an intermediate request generated by an upstream model that is queued and held in memory until it is processed by a downstream model within an ensemble pipeline. When upstream models process requests significantly faster than downstream models, these in-flight requests can accumulate and potentially lead to unbounded memory growth. This problem occurs when there is a speed mismatch between different steps in the pipeline and is particularly common in *decoupled models* that produce multiple responses per request more quickly than downstream models can consume.
+
+Consider an example ensemble model with two steps where the upstream model is 10× faster:
+1. **Preprocessing model**: Produces 100 preprocessed requests/sec
+2. **Inference model**: Consumes 10 requests/sec
+
+Without backpressure, requests accumulate in the pipeline faster than they can be processed, eventually leading to out-of-memory errors.
+
+The `max_inflight_requests` field in the ensemble configuration sets a limit on the number of concurrent inflight requests permitted at each ensemble step for a single inference request.
+When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.
+
+```
+ensemble_scheduling {
+  max_inflight_requests: 16
+
+  step [
+    {
+      model_name: "dali_preprocess"
+      model_version: -1
+      input_map { key: "RAW_IMAGE", value: "IMAGE" }
+      output_map { key: "PREPROCESSED_IMAGE", value: "preprocessed" }
+    },
+    {
+      model_name: "onnx_inference"
+      model_version: -1
+      input_map { key: "INPUT", value: "preprocessed" }
+      output_map { key: "OUTPUT", value: "RESULT" }
+    }
+  ]
+}
+```
+
+**Configuration:**
+* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
+  can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
+* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior).
+
+### When to Use This Feature
+
+Use `max_inflight_requests` when your ensemble pipeline includes:
+* **Streaming or decoupled models**: When models produce multiple responses per request more quickly than downstream steps can process them.
+* **Memory constraints**: Risk of unbounded memory growth from accumulating requests.
+
+### Choosing the Right Value
+
+The optimal value depends on your specific deployment, including batch size, request rate, available memory, and throughput.
+
+* **Too low**: The producer step is blocked too often, which underutilizes faster models.
+* **Too high**: Memory usage increases, diminishing the effectiveness of backpressure.
+* **Recommendation**: Start with a small value and adjust it based on memory usage and throughput monitoring.
+
+### Performance Considerations
+
+* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
+  no synchronization overhead is incurred.
+* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.
+
+  **Note**: This blocking does not cancel or internally time out intermediate requests, but clients may experience increased end-to-end latency.
+
 ## Additional Resources
 
 You can find additional end-to-end ensemble examples in the links below:
diff --git a/qa/L0_simple_ensemble/backpressure_test_models/decoupled_producer/1/model.py b/qa/L0_simple_ensemble/backpressure_test_models/decoupled_producer/1/model.py
@@ -0,0 +1,54 @@
+# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """
+    Decoupled model that produces N responses based on input value.
+    """
+
+    def execute(self, requests):
+        for request in requests:
+            # Get input - number of responses to produce
+            in_tensor = pb_utils.get_input_tensor_by_name(request, "IN")
+            count = in_tensor.as_numpy()[0]
+
+            response_sender = request.get_response_sender()
+
+            # Produce 'count' responses, each with 0.5 as the output value
+            for i in range(count):
+                out_tensor = pb_utils.Tensor("OUT", np.array([0.5], dtype=np.float32))
+                response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
+                response_sender.send(response)
+
+            # Send final flag
+            response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+
+        return None
diff --git a/qa/L0_simple_ensemble/backpressure_test_models/decoupled_producer/config.pbtxt b/qa/L0_simple_ensemble/backpressure_test_models/decoupled_producer/config.pbtxt
@@ -0,0 +1,58 @@
+# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+name: "decoupled_producer"
+backend: "python"
+max_batch_size: 0
+
+input [
+  {
+    name: "IN"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_CPU
+  }
+]
+
+model_transaction_policy {
+  decoupled: true
+}
+
diff --git a/qa/L0_simple_ensemble/backpressure_test_models/ensemble_disabled_max_inflight_requests/config.pbtxt b/qa/L0_simple_ensemble/backpressure_test_models/ensemble_disabled_max_inflight_requests/config.pbtxt
@@ -0,0 +1,75 @@
+# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+platform: "ensemble"
+max_batch_size: 0
+
+input [
+  {
+    name: "IN"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+ensemble_scheduling {
+  step [
+    {
+      model_name: "decoupled_producer"
+      model_version: -1
+      input_map {
+        key: "IN"
+        value: "IN"
+      }
+      output_map {
+        key: "OUT"
+        value: "intermediate"
+      }
+    },
+    {
+      model_name: "slow_consumer"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "intermediate"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUT"
+      }
+    }
+  ]
+}
+
diff --git a/qa/L0_simple_ensemble/ensemble_backpressure_test.py b/qa/L0_simple_ensemble/ensemble_backpressure_test.py
diff --git a/qa/L0_simple_ensemble/test.sh b/qa/L0_simple_ensemble/test.sh