Skip to content

Commit 5234502

Browse files
authored
[nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency (NVIDIA#6222)
Signed-off-by: raayandhar <[email protected]>
1 parent ef4878d commit 5234502

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
6868
-p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
6969
-v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
7070
docker.io/<username>/tensorrt_llm:main sh \
71-
-c "echo -e 'enable_attention_dp: false\nenable_min_latency: true\nenable_autotuner: false\ncuda_graph_config:\n max_batch_size: 8\nspeculative_config:\n decoding_type: Eagle\n max_draft_len: 3\n speculative_model_dir: /config/models/eagle\nkv_cache_config:\n enable_block_reuse: false' > c.yaml && \
71+
-c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n max_batch_size: 8\nspeculative_config:\n decoding_type: Eagle\n max_draft_len: 3\n speculative_model_dir: /config/models/eagle\n eagle3_one_model: true\nkv_cache_config:\n enable_block_reuse: false' > c.yaml && \
7272
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
7373
trtllm-serve /config/models/maverick \
7474
--host 0.0.0.0 --port 8000 \
@@ -141,7 +141,9 @@ docker kill <container_id>
141141

142142
## Performance Tuning
143143

144-
The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload:
144+
The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload.
145+
146+
**Note:** This configuration is optimized for minimum latency (`enable_min_latency: true`). When increasing the concurrency of requests, the tokens per second (TPS) per user degrades rapidly. This setup is designed to maximize single-user performance rather than high-concurrency throughput. For workloads with many concurrent users, you may need to adjust the configuration accordingly.
145147

146148
- `max_batch_size`: Controls how many requests can be batched together
147149
- `max_draft_len`: The number of tokens Eagle can speculate ahead

0 commit comments

Comments
 (0)