Skip to content

Commit 8602c64

Browse files
h-guo18lucaslie
authored andcommitted
[None][chore] AutoDeploy: cleanup old inference optimizer configs (NVIDIA#8039)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
1 parent 3aa44f7 commit 8602c64

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+559
-576
lines changed

docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -40,29 +40,31 @@ trtllm-bench \
4040
#### Basic Performance Configuration (`autodeploy_config.yaml`)
4141

4242
```yaml
43-
# Compilation backend
44-
compile_backend: torch-opt
45-
46-
# Runtime engine
43+
# runtime engine
4744
runtime: trtllm
4845

49-
# Model loading
46+
# model loading
5047
skip_loading_weights: false
5148

52-
# Fraction of free memory to use for kv-caches
53-
free_mem_ratio: 0.8
54-
55-
# CUDA Graph optimization
56-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
57-
58-
# Attention backend
59-
attn_backend: flashinfer
60-
6149
# Sequence configuration
6250
max_batch_size: 256
51+
52+
# transform options
53+
transforms:
54+
insert_cached_attention:
55+
# attention backend
56+
backend: flashinfer
57+
resize_kv_cache:
58+
# fraction of free memory to use for kv-caches
59+
free_mem_ratio: 0.8
60+
compile_model:
61+
# compilation backend
62+
backend: torch-opt
63+
# CUDA Graph optimization
64+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
6365
```
6466
65-
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
67+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs.
6668

6769
## Configuration Options Reference
6870

docs/source/features/auto_deploy/advanced/expert_configurations.md

Lines changed: 18 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -63,29 +63,22 @@ args:
6363
num_hidden_layers: 12
6464
hidden_size: 1024
6565
world_size: 4
66-
compile_backend: torch-compile
67-
attn_backend: triton
6866
max_seq_len: 2048
6967
max_batch_size: 16
7068
transforms:
71-
sharding:
72-
strategy: auto
73-
quantization:
74-
enabled: false
69+
detect_sharding:
70+
support_partial_config: true
71+
insert_cached_attention:
72+
backend: triton
73+
compile_model:
74+
backend: torch-compile
7575

7676
prompt:
7777
batch_size: 8
7878
sp_kwargs:
7979
max_tokens: 150
8080
temperature: 0.8
8181
top_k: 50
82-
83-
benchmark:
84-
enabled: true
85-
num: 20
86-
bs: 4
87-
isl: 1024
88-
osl: 256
8982
```
9083
9184
Create an additional override file (e.g., `production.yaml`):
@@ -94,11 +87,10 @@ Create an additional override file (e.g., `production.yaml`):
9487
# production.yaml
9588
args:
9689
world_size: 8
97-
compile_backend: torch-opt
9890
max_batch_size: 32
99-
100-
benchmark:
101-
enabled: false
91+
transforms:
92+
compile_model:
93+
backend: torch-opt
10294
```
10395

10496
Then use these configurations:
@@ -107,26 +99,26 @@ Then use these configurations:
10799
# Using single YAML config
108100
python build_and_run_ad.py \
109101
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
110-
--yaml-configs my_config.yaml
102+
--yaml-extra my_config.yaml
111103
112104
# Using multiple YAML configs (deep merged in order, later files have higher priority)
113105
python build_and_run_ad.py \
114106
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
115-
--yaml-configs my_config.yaml production.yaml
107+
--yaml-extra my_config.yaml production.yaml
116108
117109
# Targeting nested AutoDeployConfig with separate YAML
118110
python build_and_run_ad.py \
119111
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
120-
--yaml-configs my_config.yaml \
121-
--args.yaml-configs autodeploy_overrides.yaml
112+
--yaml-extra my_config.yaml \
113+
--args.yaml-extra autodeploy_overrides.yaml
122114
```
123115

124116
## Configuration Precedence and Deep Merging
125117

126118
The configuration system follows a precedence order in which higher priority sources override lower priority ones:
127119

128120
1. **CLI Arguments** (highest priority) - Direct command line arguments
129-
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
121+
1. **YAML Configs** - Files specified via `--yaml-extra` and `--args.yaml-extra`
130122
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
131123

132124
**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
@@ -152,12 +144,12 @@ args:
152144
**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
153145

154146
```bash
155-
# The outer yaml-configs affects the entire ExperimentConfig
156-
# The inner args.yaml-configs affects only the AutoDeployConfig
147+
# The outer yaml-extra affects the entire ExperimentConfig
148+
# The inner args.yaml-extra affects only the AutoDeployConfig
157149
python build_and_run_ad.py \
158150
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
159-
--yaml-configs experiment_config.yaml \
160-
--args.yaml-configs autodeploy_config.yaml \
151+
--yaml-extra experiment_config.yaml \
152+
--args.yaml-extra autodeploy_config.yaml \
161153
--args.world-size=8 # CLI override beats both YAML configs
162154
```
163155

docs/source/features/auto_deploy/advanced/workflow.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,7 @@ llm = LLM(
1818
attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
1919
skip_loading_weights=False,
2020
model_factory="AutoModelForCausalLM", # choose appropriate model factory
21-
mla_backend="MultiHeadLatentAttention", # for models that support MLA
2221
free_mem_ratio=0.8, # fraction of available memory for cache
23-
simple_shard_only=False, # tensor parallelism sharding strategy
2422
max_seq_len=<MAX_SEQ_LEN>,
2523
max_batch_size=<MAX_BATCH_SIZE>,
2624
)

docs/source/features/auto_deploy/support_matrix.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ Optimize attention operations with different attention kernel implementations:
113113

114114
| `"attn_backend"` | Description |
115115
|----------------------|-------------|
116+
| `torch` | Custom fused multi-head attention (MHA) with KV Cache reference implementation in pure PyTorch (slow!) |
116117
| `triton` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
117118
| `flashinfer` | Uses optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |
118119

docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -40,29 +40,31 @@ trtllm-bench \
4040
#### Basic Performance Configuration (`autodeploy_config.yaml`)
4141

4242
```yaml
43-
# Compilation backend
44-
compile_backend: torch-opt
45-
46-
# Runtime engine
43+
# runtime engine
4744
runtime: trtllm
4845

49-
# Model loading
46+
# model loading
5047
skip_loading_weights: false
5148

52-
# Fraction of free memory to use for kv-caches
53-
free_mem_ratio: 0.8
54-
55-
# CUDA Graph optimization
56-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
57-
58-
# Attention backend
59-
attn_backend: flashinfer
60-
6149
# Sequence configuration
6250
max_batch_size: 256
51+
52+
# transform options
53+
transforms:
54+
insert_cached_attention:
55+
# attention backend
56+
backend: flashinfer
57+
resize_kv_cache:
58+
# fraction of free memory to use for kv-caches
59+
free_mem_ratio: 0.8
60+
compile_model:
61+
# compilation backend
62+
backend: torch-opt
63+
# CUDA Graph optimization
64+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
6365
```
6466
65-
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
67+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs.
6668

6769
## Configuration Options Reference
6870

docs/source/torch/auto_deploy/advanced/expert_configurations.md

Lines changed: 18 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -63,29 +63,22 @@ args:
6363
num_hidden_layers: 12
6464
hidden_size: 1024
6565
world_size: 4
66-
compile_backend: torch-compile
67-
attn_backend: triton
6866
max_seq_len: 2048
6967
max_batch_size: 16
7068
transforms:
71-
sharding:
72-
strategy: auto
73-
quantization:
74-
enabled: false
69+
detect_sharding:
70+
support_partial_config: true
71+
insert_cached_attention:
72+
backend: triton
73+
compile_model:
74+
backend: torch-compile
7575

7676
prompt:
7777
batch_size: 8
7878
sp_kwargs:
7979
max_tokens: 150
8080
temperature: 0.8
8181
top_k: 50
82-
83-
benchmark:
84-
enabled: true
85-
num: 20
86-
bs: 4
87-
isl: 1024
88-
osl: 256
8982
```
9083
9184
Create an additional override file (e.g., `production.yaml`):
@@ -94,11 +87,10 @@ Create an additional override file (e.g., `production.yaml`):
9487
# production.yaml
9588
args:
9689
world_size: 8
97-
compile_backend: torch-opt
9890
max_batch_size: 32
99-
100-
benchmark:
101-
enabled: false
91+
transforms:
92+
compile_model:
93+
backend: torch-opt
10294
```
10395

10496
Then use these configurations:
@@ -107,26 +99,26 @@ Then use these configurations:
10799
# Using single YAML config
108100
python build_and_run_ad.py \
109101
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
110-
--yaml-configs my_config.yaml
102+
--yaml-extra my_config.yaml
111103
112104
# Using multiple YAML configs (deep merged in order, later files have higher priority)
113105
python build_and_run_ad.py \
114106
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
115-
--yaml-configs my_config.yaml production.yaml
107+
--yaml-extra my_config.yaml production.yaml
116108
117109
# Targeting nested AutoDeployConfig with separate YAML
118110
python build_and_run_ad.py \
119111
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
120-
--yaml-configs my_config.yaml \
121-
--args.yaml-configs autodeploy_overrides.yaml
112+
--yaml-extra my_config.yaml \
113+
--args.yaml-extra autodeploy_overrides.yaml
122114
```
123115

124116
## Configuration Precedence and Deep Merging
125117

126118
The configuration system follows a precedence order in which higher priority sources override lower priority ones:
127119

128120
1. **CLI Arguments** (highest priority) - Direct command line arguments
129-
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
121+
1. **YAML Configs** - Files specified via `--yaml-extra` and `--args.yaml-extra`
130122
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
131123

132124
**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
@@ -152,12 +144,12 @@ args:
152144
**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
153145

154146
```bash
155-
# The outer yaml-configs affects the entire ExperimentConfig
156-
# The inner args.yaml-configs affects only the AutoDeployConfig
147+
# The outer yaml-extra affects the entire ExperimentConfig
148+
# The inner args.yaml-extra affects only the AutoDeployConfig
157149
python build_and_run_ad.py \
158150
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
159-
--yaml-configs experiment_config.yaml \
160-
--args.yaml-configs autodeploy_config.yaml \
151+
--yaml-extra experiment_config.yaml \
152+
--args.yaml-extra autodeploy_config.yaml \
161153
--args.world-size=8 # CLI override beats both YAML configs
162154
```
163155

docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -42,23 +42,31 @@ trtllm-serve \
4242
Example `autodeploy_config.yaml`:
4343

4444
```yaml
45-
# Compilation backend for AutoDeploy
46-
compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt
47-
48-
# Runtime engine
49-
runtime: trtllm # options: trtllm, demollm
50-
51-
# Model loading
52-
skip_loading_weights: false # set true for architecture-only perf runs
53-
54-
# KV cache memory
55-
free_mem_ratio: 0.8 # fraction of free GPU mem for KV cache
56-
57-
# CUDA graph optimization
58-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
59-
60-
# Attention backend
61-
attn_backend: flashinfer # recommended for best performance
45+
# runtime engine
46+
runtime: trtllm
47+
48+
# model loading
49+
skip_loading_weights: false
50+
51+
# Sequence configuration
52+
max_batch_size: 256
53+
54+
# multi-gpu execution
55+
world_size: 1
56+
57+
# transform options
58+
transforms:
59+
insert_cached_attention:
60+
# attention backend
61+
backend: flashinfer
62+
resize_kv_cache:
63+
# fraction of free memory to use for kv-caches
64+
free_mem_ratio: 0.8
65+
compile_model:
66+
# compilation backend
67+
backend: torch-opt
68+
# CUDA Graph optimization
69+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
6270
```
6371
6472
## Limitations and tips

docs/source/torch/auto_deploy/advanced/workflow.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,18 @@ from tensorrt_llm._torch.auto_deploy import LLM
1212
llm = LLM(
1313
model=<HF_MODEL_CARD_OR_DIR>,
1414
world_size=<DESIRED_WORLD_SIZE>,
15-
compile_backend="torch-compile",
15+
model_factory="AutoModelForCausalLM", # choose appropriate model factory
1616
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
17-
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
18-
attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
17+
transforms={
18+
"insert_cached_attention": {"backend": "flashinfer"}, # or "triton"
19+
"insert_cached_mla_attention": {"backend": "MultiHeadLatentAttention"},
20+
"resize_kv_cache": {"free_mem_ratio": 0.8},
21+
"compile_model": {"backend": "torch-compile"},
22+
"detect_sharding": {"simple_shard_only": False},
23+
24+
},
25+
attn_page_size=64, # page size for attention
1926
skip_loading_weights=False,
20-
model_factory="AutoModelForCausalLM", # choose appropriate model factory
21-
mla_backend="MultiHeadLatentAttention", # for models that support MLA
22-
free_mem_ratio=0.8, # fraction of available memory for cache
23-
simple_shard_only=False, # tensor parallelism sharding strategy
2427
max_seq_len=<MAX_SEQ_LEN>,
2528
max_batch_size=<MAX_BATCH_SIZE>,
2629
)

examples/auto_deploy/.vscode/launch.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@
1010
"--model=meta-llama/Meta-Llama-3.1-8B-Instruct",
1111
"--args.world-size=2",
1212
"--args.runtime=demollm",
13-
"--args.compile-backend=torch-simple",
13+
"--args.transforms.compile-model.backend=torch-simple",
1414
"--args.attn-page-size=16",
15-
"--args.attn-backend=flashinfer",
15+
"--args.transforms.insert-cached-attention.backend=flashinfer",
1616
"--args.model-factory=AutoModelForCausalLM",
1717
"--benchmark.enabled=false",
1818
"--prompt.batch-size=2",

0 commit comments

Comments
 (0)