remove attn_temperature_tuning in default user guide

luccafong · luccafong · commit 3a0029307da5 · 2025-04-07T23:03:04.000-07:00
Signed-off-by: Lu Fang &lt;fanglu@fb.com&gt;
diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md
@@ -35,7 +35,7 @@ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruc
 ```
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
   --tensor-parallel-size 8 \
-  --max-model-len 430000 --override-generation-config='{"attn_temperature_tuning": true}'
+  --max-model-len 430000'
 ```
 
 On 8x H200 GPUs:
@@ -45,19 +45,17 @@ On 8x H200 GPUs:
 ```
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --tensor-parallel-size 8 \
-  --max-model-len 3600000 --override-generation-config='{"attn_temperature_tuning": true}'
+  --max-model-len 3600000'
 ```
 
 * Maverick (up to 1M context):
 
 ```
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
   --tensor-parallel-size 8
-  --max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}'
+  --max-model-len 1000000'
 ```
 
-Note: we highly recommend to turn on attn_temperature_tuning to improve accuracy for long contexts longer than 32K tokens, and VLLM_DISABLE_COMPILE_CACHE=1 is required.
-
 **Multimodality:**
 
 The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py).
@@ -108,4 +106,3 @@ We extend our sincere thanks to the Meta team for their implementation of the mo
 We also thank the AMD team for their support in enabling these models on MI300X:  [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang.
 
 The vLLM team’s performance benchmarks were run on hardware generously provided by Nebius and NVIDIA.
-