Skip to content

Commit 93702e4

Browse files
authored
feat: SLA Profiling and Recommending Parallelization Mapping (ai-dynamo#1114)
1 parent eb821be commit 93702e4

File tree

6 files changed

+692
-4
lines changed

6 files changed

+692
-4
lines changed

container/deps/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ ftfy
1919
grpcio-tools==1.66.0
2020
httpx
2121
kubernetes==32.0.1
22+
matplotlib
2223
msgspec
2324
mypy
2425
numpy
209 KB
Loading
118 KB
Loading

docs/planner.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The planner is a component that monitors the state of the system and makes adjus
3131
* Disaggregated ✅
3232
* Planner actions:
3333
* Load-based scaling up/down prefill/decode workers ✅
34-
* SLA-based scaling up/down prefill/decode workers
34+
* SLA-based scaling up/down prefill/decode workers ✅ (with some limitations)
3535
* Adjusting engine knobs ❌
3636

3737
## Load-based Scaling Up/Down Prefill/Decode Workers
@@ -48,6 +48,44 @@ There are two additional rules set by planner to prevent over-compensation:
4848
1. After a new decode worker is added, since it needs time to populate the kv cache, planner will not scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
4949
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
5050

51+
## Comply with SLA
52+
53+
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
54+
55+
> [!NOTE]
56+
> Currently, the script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
57+
> Currently, we assume there is no piggy-backed prefill requests in the decode engine. Even if there are some short piggy-backed prefill requests in the decode engine, it should not affect the ITL too much in most conditions. However, if the piggy-backed prefill requests are too much, the ITL might be inaccurate.
58+
59+
```bash
60+
python -m utils.profile_sla \
61+
--config <path-to-dynamo-config-file> \
62+
--output-dir <path-to-profile-results-dir> \
63+
--isl <target-isl> \
64+
--osl <target-osl> \
65+
--ttft <target-ttft-(ms)> \
66+
--itl <target-itl-(ms)>
67+
```
68+
69+
The script will first detect the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it will profile the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reusing. For decode, since the ITL (or iteration time) is relevant with how many requests are in-flight, the script will measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
70+
71+
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
72+
73+
![Prefill Performance](images/h100_prefill_performance.png)
74+
![Decode Performance](images/h100_decode_performance.png)
75+
76+
For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner.
77+
78+
For the decode performance, the script will plot the ITL for different TP sizes and different in-flight requests. Similarly, it will select the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommend the upper and lower bounds of the kv cache utilization rate to be used in planner.
79+
80+
The following information will be printed out in the terminal:
81+
```
82+
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
83+
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
84+
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
85+
2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
86+
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.10/0.2
87+
```
88+
5189
## Usage
5290
The planner is started automatically as part of Dynamo pipelines when running `dynamo serve`. You can configure the planner just as you would any other component in your pipeline either via YAML configuration or through CLI arguments.
5391

examples/llm/components/planner.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,9 @@ async def make_adjustments(self):
232232
)
233233
logger.info(f"Current engines use {curr_gpu_usage} GPUs")
234234

235-
avg_prefill_queue_load = np.mean(self.prefill_queue_load)
235+
avg_prefill_queue_load = np.mean(self.prefill_queue_load) / len(
236+
self.p_endpoints
237+
)
236238
avg_kv_load = np.mean(self.kv_load)
237239
# first check if we need to scale down any workers
238240
if (
@@ -467,13 +469,13 @@ async def start_planner(runtime: DistributedRuntime, args: argparse.Namespace):
467469
"--prefill-queue-scale-up-threshold",
468470
type=float,
469471
default=PlannerDefaults.prefill_queue_scale_up_threshold,
470-
help="Queue utilization threshold to scale up prefill workers",
472+
help="Queue utilization threshold to scale up prefill workers, this threshold is per prefill worker",
471473
)
472474
parser.add_argument(
473475
"--prefill-queue-scale-down-threshold",
474476
type=float,
475477
default=PlannerDefaults.prefill_queue_scale_down_threshold,
476-
help="Queue utilization threshold to scale down prefill workers",
478+
help="Queue utilization threshold to scale down prefill workers, this threshold is per prefill worker",
477479
)
478480
parser.add_argument(
479481
"--decode-engine-num-gpu",

0 commit comments

Comments
 (0)