Skip to content

Commit b4e5df0

Browse files
authored
Breaking change: perf: Enable scheduling overlap by default (#4174)
Signed-off-by: Kaiyu Xie <[email protected]>
1 parent 404fbe9 commit b4e5df0

File tree

54 files changed

+110
-127
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+110
-127
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,6 @@ YOUR_DATA_PATH=<your dataset file following the format>
135135

136136
cat >./extra-llm-api-config.yml<<EOF
137137
pytorch_backend_config:
138-
enable_overlap_scheduler: true
139138
use_cuda_graph: true
140139
moe_backend: TRTLLM
141140
speculative_config:
@@ -218,7 +217,6 @@ pytorch_backend_config:
218217
- 256
219218
- 384
220219
print_iter_log: true
221-
enable_overlap_scheduler: true
222220
enable_attention_dp: true
223221
EOF
224222

@@ -260,7 +258,6 @@ YOUR_DATA_PATH=<your dataset file following the format>
260258

261259
cat >./extra-llm-api-config.yml<<EOF
262260
pytorch_backend_config:
263-
enable_overlap_scheduler: true
264261
use_cuda_graph: true
265262
speculative_config:
266263
decoding_type: MTP
@@ -314,7 +311,6 @@ pytorch_backend_config:
314311
use_cuda_graph: true
315312
cuda_graph_batch_sizes:
316313
- 128
317-
enable_overlap_scheduler: true
318314
enable_attention_dp: true
319315
EOF
320316

examples/disaggregated/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
99
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
1010

1111
```
12-
echo -e "pytorch_backend_config:\n enable_overlap_scheduler: False\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
12+
echo -e "pytorch_backend_config:\n disable_overlap_scheduler: True\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
1313
echo -e "cache_transceiver_config:\n max_num_tokens: 2048" > gen_extra-llm-api-config.yml
1414
1515
export TRTLLM_USE_UCX_KVCACHE=1
@@ -65,7 +65,7 @@ model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
6565
backend: "pytorch"
6666
pytorch_backend_config:
6767
use_cuda_graph: False
68-
enable_overlap_scheduler: False
68+
disable_overlap_scheduler: True
6969
context_servers:
7070
num_instances: 1
7171
tensor_parallel_size: 1

examples/disaggregated/disagg_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ free_gpu_memory_fraction: 0.25
55
backend: "pytorch"
66
pytorch_backend_config:
77
use_cuda_graph: False
8-
enable_overlap_scheduler: False
8+
disable_overlap_scheduler: True
99
context_servers:
1010
num_instances: 1
1111
tensor_parallel_size: 1

examples/llm-api/llm_inference_kv_events.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,7 @@
66

77

88
def main():
9-
pytorch_config = PyTorchConfig(enable_overlap_scheduler=True,
10-
autotuner_enabled=False,
9+
pytorch_config = PyTorchConfig(autotuner_enabled=False,
1110
kv_cache_dtype='auto')
1211

1312
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",

examples/llm-api/llm_mgmn_trtllm_bench.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ srun -l \
7676
cat > /tmp/pytorch_extra_args.txt << EOF
7777
pytorch_backend_config:
7878
use_cuda_graph: false
79-
enable_overlap_scheduler: true
8079
cuda_graph_padding_enabled: false
8180
print_iter_log: true
8281
enable_attention_dp: false

examples/models/core/deepseek_v3/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,10 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
2121
- [Quick Start](#quick-start)
2222
- [Run a single inference](#run-a-single-inference)
2323
- [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
24+
- [Relaxed acceptance](#relaxed-acceptance)
2425
- [Long context support](#long-context-support)
26+
- [ISL-64k-OSL-1024](#isl-64k-osl-1024)
27+
- [ISL-128k-OSL-1024](#isl-128k-osl-1024)
2528
- [Evaluation](#evaluation)
2629
- [Serving](#serving)
2730
- [Use trtllm-serve](#use-trtllm-serve)
@@ -36,6 +39,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
3639
- [FP8 KV Cache and MLA](#fp8-kv-cache-and-mla)
3740
- [W4AFP8](#w4afp8)
3841
- [Notes and Troubleshooting](#notes-and-troubleshooting)
42+
- [Known Issues](#known-issues)
3943

4044

4145
## Hardware Requirements
@@ -136,7 +140,6 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
136140

137141
cat <<EOF > /tmp/extra-llm-api-config.yml
138142
pytorch_backend_config:
139-
enable_overlap_scheduler: true
140143
use_cuda_graph: true
141144
cuda_graph_padding_enabled: true
142145
cuda_graph_batch_sizes: [1, 4, 8, 12]
@@ -165,7 +168,6 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
165168

166169
cat <<EOF > /tmp/extra-llm-api-config.yml
167170
pytorch_backend_config:
168-
enable_overlap_scheduler: true
169171
use_cuda_graph: true
170172
cuda_graph_padding_enabled: true
171173
cuda_graph_batch_sizes: [1, 2]
@@ -192,7 +194,6 @@ Evaluate the model accuracy using `trtllm-eval`.
192194
cat >./extra-llm-api-config.yml <<EOF
193195
pytorch_backend_config:
194196
use_cuda_graph: true
195-
enable_overlap_scheduler: true
196197
enable_attention_dp: true
197198
EOF
198199
```
@@ -249,7 +250,6 @@ pytorch_backend_config:
249250
- 256
250251
- 384
251252
print_iter_log: true
252-
enable_overlap_scheduler: true
253253
enable_attention_dp: true
254254
EOF
255255

@@ -441,7 +441,6 @@ pytorch_backend_config:
441441
- 256
442442
- 384
443443
print_iter_log: true
444-
enable_overlap_scheduler: true
445444
enable_attention_dp: true
446445
EOF
447446
```

examples/models/core/qwen/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
2222
- [Run a single inference](#run-a-single-inference)
2323
- [Evaluation](#evaluation)
2424
- [Serving](#serving)
25-
- [Notes and Troubleshooting](#notes-and-troubleshooting)
25+
- [Notes and Troubleshooting](#notes-and-troubleshooting)
2626
- [Credits](#credits)
2727

2828
## Overview
@@ -668,7 +668,6 @@ pytorch_backend_config:
668668
- 256
669669
- 384
670670
print_iter_log: true
671-
enable_overlap_scheduler: true
672671
enable_attention_dp: true
673672
EOF
674673

examples/pytorch/quickstart_advanced.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ def add_llm_args(parser):
7272
parser.add_argument("--kv_cache_fraction", type=float, default=None)
7373

7474
# Runtime
75-
parser.add_argument('--enable_overlap_scheduler',
75+
parser.add_argument('--disable_overlap_scheduler',
7676
default=False,
7777
action='store_true')
7878
parser.add_argument('--enable_chunked_prefill',
@@ -124,7 +124,7 @@ def parse_arguments():
124124

125125
def setup_llm(args):
126126
pytorch_config = PyTorchConfig(
127-
enable_overlap_scheduler=args.enable_overlap_scheduler,
127+
disable_overlap_scheduler=args.disable_overlap_scheduler,
128128
kv_cache_dtype=args.kv_cache_dtype,
129129
attn_backend=args.attention_backend,
130130
use_cuda_graph=args.use_cuda_graph,

examples/scaffolding/run_best_of_n_with_reward.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ def main():
3939
max_batch_size=args.sample_num,
4040
max_num_tokens=8192,
4141
kv_cache_free_gpu_memory_fraction=0.2,
42-
enable_overlap_scheduler=False)
42+
disable_overlap_scheduler=True)
4343
workers[NativeGenerationController.WorkerTag.GENERATION] = gen_worker
4444
workers[QwenRewardController.WorkerTag.REWARD] = reward_worker
4545

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@ def create_autodeploy_executor(
302302
model_engine=engine,
303303
decoder=decoder,
304304
dist=mpi_dist,
305-
enable_overlap_scheduler=py_config.enable_overlap_scheduler,
305+
disable_overlap_scheduler=py_config.disable_overlap_scheduler,
306306
max_input_len=executor_config.max_input_len,
307307
max_batch_size=executor_config.max_batch_size,
308308
max_draft_tokens=executor_config.speculative_config.max_draft_tokens

0 commit comments

Comments
 (0)