|
1 | 1 | <!-- |
2 | | -# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 2 | +# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
3 | 3 | # |
4 | 4 | # Redistribution and use in source and binary forms, with or without |
5 | 5 | # modification, are permitted provided that the following conditions |
@@ -183,6 +183,67 @@ performance, you can use |
183 | 183 | [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) |
184 | 184 | to find the optimal model configurations. |
185 | 185 |
|
| 186 | +## Managing Memory Usage in Ensemble Models |
| 187 | + |
| 188 | +An *inflight request* refers to an intermediate request generated by an upstream model that is queued and held in memory until it is processed by a downstream model within an ensemble pipeline. When upstream models process requests significantly faster than downstream models, these in-flight requests can accumulate and potentially lead to unbounded memory growth. This problem occurs when there is a speed mismatch between different steps in the pipeline and is particularly common in *decoupled models* that produce multiple responses per request more quickly than downstream models can consume. |
| 189 | + |
| 190 | +Consider an example ensemble model with two steps where the upstream model is 10× faster: |
| 191 | +1. **Preprocessing model**: Produces 100 preprocessed requests/sec |
| 192 | +2. **Inference model**: Consumes 10 requests/sec |
| 193 | + |
| 194 | +Without backpressure, requests accumulate in the pipeline faster than they can be processed, eventually leading to out-of-memory errors. |
| 195 | + |
| 196 | +The `max_inflight_requests` field in the ensemble configuration sets a limit on the number of concurrent inflight requests permitted at each ensemble step for a single inference request. |
| 197 | +When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth. |
| 198 | + |
| 199 | +``` |
| 200 | +ensemble_scheduling { |
| 201 | + max_inflight_requests: 16 |
| 202 | +
|
| 203 | + step [ |
| 204 | + { |
| 205 | + model_name: "dali_preprocess" |
| 206 | + model_version: -1 |
| 207 | + input_map { key: "RAW_IMAGE", value: "IMAGE" } |
| 208 | + output_map { key: "PREPROCESSED_IMAGE", value: "preprocessed" } |
| 209 | + }, |
| 210 | + { |
| 211 | + model_name: "onnx_inference" |
| 212 | + model_version: -1 |
| 213 | + input_map { key: "INPUT", value: "preprocessed" } |
| 214 | + output_map { key: "OUTPUT", value: "RESULT" } |
| 215 | + } |
| 216 | + ] |
| 217 | +} |
| 218 | +``` |
| 219 | + |
| 220 | +**Configuration:** |
| 221 | +* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess` |
| 222 | + can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response. |
| 223 | +* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior). |
| 224 | + |
| 225 | +### When to Use This Feature |
| 226 | + |
| 227 | +Use `max_inflight_requests` when your ensemble pipeline includes: |
| 228 | +* **Streaming or decoupled models**: When models produce multiple responses per request more quickly than downstream steps can process them. |
| 229 | +* **Memory constraints**: Risk of unbounded memory growth from accumulating requests. |
| 230 | + |
| 231 | +### Choosing the Right Value |
| 232 | + |
| 233 | +The optimal value depends on your specific deployment, including batch size, request rate, available memory, and throughput. |
| 234 | + |
| 235 | +* **Too low**: The producer step is blocked too often, which underutilizes faster models. |
| 236 | +* **Too high**: Memory usage increases, diminishing the effectiveness of backpressure. |
| 237 | +* **Recommendation**: Start with a small value and adjust it based on memory usage and throughput monitoring. |
| 238 | + |
| 239 | +### Performance Considerations |
| 240 | + |
| 241 | +* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default), |
| 242 | + no synchronization overhead is incurred. |
| 243 | +* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency. |
| 244 | + |
| 245 | + **Note**: This blocking does not cancel or internally time out intermediate requests, but clients may experience increased end-to-end latency. |
| 246 | + |
186 | 247 | ## Additional Resources |
187 | 248 |
|
188 | 249 | You can find additional end-to-end ensemble examples in the links below: |
|
0 commit comments