You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AutoDeploy is designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
9
+
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
@@ -146,7 +146,7 @@ Below is a non-exhaustive list of common config options:
146
146
|`--args.skip-loading-weights`| Only load the architecture, not the weights |
147
147
|`--args.model-kwargs`| Extra kwargs that are being passed to the model initializer in the model factory |
148
148
|`--args.tokenizer-kwargs`| Extra kwargs that are being passed to the tokenizer initializer in the model factory |
149
-
|`--args.world-size`| The number of GPUs for Tensor Parallel|
149
+
|`--args.world-size`| The number of GPUs used for auto-sharding the model|
150
150
|`--args.runtime`| Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
151
151
|`--args.compile-backend`| Specifies how to compile the graph at the end |
152
152
|`--args.attn-backend`| Specifies kernel implementation for attention |
@@ -157,7 +157,7 @@ Below is a non-exhaustive list of common config options:
157
157
|`--prompt.batch-size`| Number of queries to generate |
158
158
|`--benchmark.enabled`| Whether to run the built-in benchmark (true/false) |
159
159
160
-
For default values and additional configuration options, refer to the `ExperimentConfig` class in [build_and_run_ad.py](./build_and_run_ad.py) file.
160
+
For default values and additional configuration options, refer to the [`ExperimentConfig`](./build_and_run_ad.py) class in [build_and_run_ad.py](./build_and_run_ad.py) file.
161
161
162
162
Here is a more complete example of using the script:
163
163
@@ -172,7 +172,7 @@ python build_and_run_ad.py \
172
172
--benchmark.enabled True
173
173
```
174
174
175
-
####Logging Level
175
+
### Logging Level
176
176
177
177
Use the following env variable to specify the logging level of our built-in logger ordered by
178
178
decreasing verbosity;
@@ -223,17 +223,14 @@ AutoDeploy can be seamlessly integrated into your existing workflows using TRT-L
223
223
224
224
Here is an example of how you can build an LLM object with AutoDeploy integration:
225
225
226
-
<details>
227
-
<summary>Click to expand the example</summary>
228
-
229
226
```
230
227
from tensorrt_llm._torch.auto_deploy import LLM
231
228
232
229
233
230
# Construct the LLM high-level interface object with autodeploy as backend
234
231
llm = LLM(
235
232
model=<HF_MODEL_CARD_OR_DIR>,
236
-
world_size=<NUM_WORLD_RANK>,
233
+
world_size=<DESIRED_WORLD_SIZE>,
237
234
compile_backend="torch-compile",
238
235
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
239
236
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
@@ -249,28 +246,207 @@ llm = LLM(
249
246
250
247
```
251
248
249
+
Please consult the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) and the
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
257
280
258
-
## Roadmap
281
+
<details>
282
+
<summary>Click to expand for detailed configuration examples</summary>
259
283
260
-
1.**Model Coverage:**
284
+
#### CLI Arguments with Dot Notation
261
285
262
-
- Expand support for additional LLM variants and features:
263
-
- LoRA
264
-
- Speculative Decoding
265
-
- Model specialization for disaggregated serving
286
+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the [`ExperimentConfig`](./build_and_run_ad.py) and nested [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.) objects:
266
287
267
-
1.**Performance Optimization:**
288
+
```bash
289
+
# Configure model parameters
290
+
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
291
+
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
292
+
# specified as CLI arg
293
+
python build_and_run_ad.py \
294
+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
295
+
--args.model-kwargs.num-hidden-layers=10 \
296
+
--args.model-kwargs.hidden-size=2048 \
297
+
--args.tokenizer-kwargs.padding-side=left
268
298
269
-
- Enhance inference speed and efficiency with:
270
-
- MoE fusion and all-reduce fusion techniques
271
-
- Reuse of TRT-LLM PyTorch operators for greater efficiency
Both [`ExperimentConfig`](./build_and_run_ad.py) and [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) inherit from [`DynamicYamlMixInForSettings`](../../tensorrt_llm/_torch/auto_deploy/utils/_config.py), enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
320
+
321
+
Create a YAML configuration file (e.g., `my_config.yaml`):
322
+
323
+
```yaml
324
+
# my_config.yaml
325
+
args:
326
+
model_kwargs:
327
+
num_hidden_layers: 12
328
+
hidden_size: 1024
329
+
world_size: 4
330
+
compile_backend: torch-compile
331
+
attn_backend: triton
332
+
max_seq_len: 2048
333
+
max_batch_size: 16
334
+
transforms:
335
+
sharding:
336
+
strategy: auto
337
+
quantization:
338
+
enabled: false
339
+
340
+
prompt:
341
+
batch_size: 8
342
+
sp_kwargs:
343
+
max_tokens: 150
344
+
temperature: 0.8
345
+
top_k: 50
346
+
347
+
benchmark:
348
+
enabled: true
349
+
num: 20
350
+
bs: 4
351
+
isl: 1024
352
+
osl: 256
353
+
```
354
+
355
+
Create an additional override file (e.g., `production.yaml`):
356
+
357
+
```yaml
358
+
# production.yaml
359
+
args:
360
+
world_size: 8
361
+
compile_backend: torch-opt
362
+
max_batch_size: 32
363
+
364
+
benchmark:
365
+
enabled: false
366
+
```
367
+
368
+
Then use these configurations:
369
+
370
+
```bash
371
+
# Using single YAML config
372
+
python build_and_run_ad.py \
373
+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
374
+
--yaml-configs my_config.yaml
375
+
376
+
# Using multiple YAML configs (deep merged in order, later files have higher priority)
377
+
python build_and_run_ad.py \
378
+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
379
+
--yaml-configs my_config.yaml production.yaml
380
+
381
+
# Targeting nested AutoDeployConfig with separate YAML
382
+
python build_and_run_ad.py \
383
+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
384
+
--yaml-configs my_config.yaml \
385
+
--args.yaml-configs autodeploy_overrides.yaml
386
+
```
387
+
388
+
#### Configuration Precedence and Deep Merging
389
+
390
+
The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
391
+
392
+
1. **CLI Arguments** (highest priority) - Direct command line arguments
393
+
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
394
+
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
395
+
396
+
**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
397
+
398
+
```yaml
399
+
# Base config
400
+
args:
401
+
model_kwargs:
402
+
num_hidden_layers: 10
403
+
hidden_size: 1024
404
+
max_seq_len: 2048
405
+
```
406
+
407
+
```yaml
408
+
# Override config
409
+
args:
410
+
model_kwargs:
411
+
hidden_size: 2048 # This will override
412
+
# num_hidden_layers: 10 remains unchanged
413
+
world_size: 4 # This gets added
414
+
```
415
+
416
+
**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
417
+
418
+
```bash
419
+
# The outer yaml-configs affects the entire ExperimentConfig
420
+
# The inner args.yaml-configs affects only the AutoDeployConfig
421
+
python build_and_run_ad.py \
422
+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
423
+
--yaml-configs experiment_config.yaml \
424
+
--args.yaml-configs autodeploy_config.yaml \
425
+
--args.world-size=8 # CLI override beats both YAML configs
426
+
```
427
+
428
+
#### Built-in Default Configuration
429
+
430
+
Both [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) and [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) classes automatically load a built-in [`default.yaml`](../../tensorrt_llm/_torch/auto_deploy/config/default.yaml) configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the [`_get_config_dict()`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) function and defines default transform configurations for graph optimization stages.
431
+
432
+
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
0 commit comments