[nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency (NVIDIA#6222)

raayandhar · web-flow · commit 52345027171e · 2025-07-22T11:28:23.000-07:00
Signed-off-by: raayandhar &lt;rdhar@nvidia.com&gt;
diff --git a/docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md b/docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md
@@ -68,7 +68,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
     -p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
     -v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
     docker.io/<username>/tensorrt_llm:main sh \
-        -c "echo -e 'enable_attention_dp: false\nenable_min_latency: true\nenable_autotuner: false\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
+        -c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\n  eagle3_one_model: true\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
         TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
         trtllm-serve /config/models/maverick \
             --host 0.0.0.0 --port 8000 \
@@ -141,7 +141,9 @@ docker kill <container_id>
 
 ## Performance Tuning
 
-The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload:
+The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload.
+
+**Note:** This configuration is optimized for minimum latency (`enable_min_latency: true`). When increasing the concurrency of requests, the tokens per second (TPS) per user degrades rapidly. This setup is designed to maximize single-user performance rather than high-concurrency throughput. For workloads with many concurrent users, you may need to adjust the configuration accordingly.
 
 - `max_batch_size`: Controls how many requests can be batched together
 - `max_draft_len`: The number of tokens Eagle can speculate ahead