Merge pull request #726 from ROCm/akaratza_env_vars

AndreasKaratzas · web-flow · commit f6761f8be06d · 2025-10-03T12:24:05.000-05:00
Minor modifications in env var description and bucketing
diff --git a/docs/configuration/rocm_env_vars.csv b/docs/configuration/rocm_env_vars.csv
@@ -1,20 +1,20 @@
 NAME,DEFAULT,DESCRIPTION,BUCKET,TEAM,LABELS,RECOMMENDED
-VLLM_ROCM_CUSTOM_PAGED_ATTN,1,Custom paged attention kernel for MI3* cards.|`VLLM_ROCM_CUSTOM_PAGED_ATTN` is recommended versus AITER because right now there may be some inaccuracy issues with AITER. Specifically AITER has been found sometimes to be unstable meaning that it will randomly either break or have a windows of inaccurate results. My experiments were only on throughput so I have not verified those claims. This custom ROCm kernel so far as never broken or returned any inaccurate results. That is why it oi also by default activated in upstream.,3,vllm,v0,1
-VLLM_ROCM_FP8_PADDING,1,Pad the fp8 weights to 256 bytes for ROCm.|This is an optimization on ROCm platform which can benefit from tensors located far enough from one another in memory.|This is an fp8 only setting.,3,vllm,gfx942,X|1|0
-VLLM_ROCM_MOE_PADDING,1,Pad the weights for the MoE kernel.|This is only for Fused MoE,3,vllm,gfx942,X|1|0
-VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16,1,Converts input from bf16 to fp16.|Due to the lack of the bfloat16 asm instruction sometimes bfloat16 kernels are slower than fp16.,3,vllm,bf16|fp16,1
+VLLM_ROCM_CUSTOM_PAGED_ATTN,1,Custom paged attention kernel for MI3* cards.|For stability and accuracy reliability `VLLM_ROCM_CUSTOM_PAGED_ATTN` is recommended versus AITER because this custom ROCm kernel so far as never broken or returned any inaccurate results.,2,vllm,llama|dense|fp8,1
+VLLM_ROCM_FP8_PADDING,1,Pad the fp8 weights to 256 bytes for ROCm.|This is an optimization on ROCm platform which can benefit from tensors located far enough from one another in memory.|This is an fp8 only setting.,2,vllm,gfx942,X|1|0
+VLLM_ROCM_MOE_PADDING,1,Pad the weights for the MoE kernel.|This is only for Fused MoE,2,vllm,moe,1
+VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16,1,Converts input from bf16 to fp16.|Due to the lack of the bfloat16 asm instruction sometimes bfloat16 kernels are slower than fp16.,2,vllm,bf16|fp16,1
 VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB,None,Controls the maximum allowed number of data bytes (MB) for custom quick `allreduce` communication.|The default is later modified to 2048 MB if variable is `None`. Data exceeding this size will use either custom allreduce or RCCL communication.,2,vllm,bf16|fp16,None
-VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,bf16|fp16,"NONE"
-VLLM_ROCM_USE_AITER,0,Enables aiter ops.|Acts as a parent switch to enable the rest of the other operations.|`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_ROCM_USE_AITER=1`.|Some use cases include: `amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV` `amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV`.,3,vllm,moe|dense,1
-VLLM_ROCM_USE_AITER_FP8BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.|This kernel is invoked in `MLACommonImpl` and named `aiter_triton_fp8_bmm`.,4,vllm,fp8|mla,1
-VLLM_ROCM_USE_AITER_LINEAR,1,Uses AITER linear op **IF** AITER ops are enabled.|The following list of related ops: `scaled_mm` (per-tensor / rowwise).,3,vllm,moe|dense,X|1|0
-VLLM_ROCM_USE_AITER_MHA,1,Uses AITER MHA op.|`VLLM_ROCM_USE_AITER_MHA` is the default AITER attention mechanism. That is why it is also activated by default. The reason that in quantized models I did not observe any regression is that they are quantized and hence there may be some other fused kernel taking its place hich has a comparable performance.,1,vllm,moe|dense,1
-VLLM_ROCM_USE_AITER_MLA,1,Uses AITER MLA (Latent Attention) op.,3,vllm,moe|dense,X|1|0
-VLLM_ROCM_USE_AITER_MOE,1,Uses AITER MoE op.,4,vllm,moe,X|0|1
-VLLM_ROCM_USE_AITER_PAGED_ATTN,0,Uses AITER paged attention.,1,vllm,moe|dense|v0,X|0|1
-VLLM_ROCM_USE_AITER_RMSNORM,1,Enables AITER implementation of `RMSnorm`.,4,vllm,dense,1d|0m
-VLLM_ROCM_FP8_MFMA_PAGE_ATTN,0,Uses the fp8 mfma in rocm paged attention else uses fp16.|Kernel that is affected id `torch.ops._rocm_C.paged_attention` under `paged_attention_rocm`.,4,vllm,fp8,0
-VLLM_ROCM_USE_SKINNY_GEMM,1,Uses ROCm skinny GEMMs.|This skinny GEMM kernels are useful for unquantized linear on ROCm.,3,vllm,bf16|fp16|bsleq4,1
-VLLM_USE_AITER_UNIFIED_ATTENTION,0,Uses AITER triton unified attention.|Activated with `VLLM_ROCM_USE_AITER_MHA` set to `0`.| Sets `self.unified_attention = aiter.ops.triton.unified_attention.unified_attention` inside `TritonAttentionImpl`.,4,vllm,gpt|moe,1g|1m|0d
-VLLM_USE_TRITON_FLASH_ATTN,1,Enable Triton flash attention. Used by default especially in ROCm systems.|If platform is ROCm we need to set `VLLM_USE_TRITON_FLASH_ATTN=0` for phi3v & paligemma models because ROCm Triton FA can run into shared memory issues with these models use other backends in the meantime. There is a similar note under the `test_quark` file for the Quark model test.|The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol: `VLLM_USE_TRITON_FLASH_ATTN=0`,2,vllm,v0,X|1|0
-VLLM_V1_USE_PREFILL_DECODE_ATTENTION,0,Use separate prefill and decode kernels for V1 attention instead of the unified triton kernel.|It usually improves prefill performance at the cost of higher gpu memory utilization.|If activated then uses `PagedAttention.split_kv_cache()` and `chunked_prefill_paged_decode`.,4,vllm,dense|llama,1d|0m
+VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,allreduce,"NONE"|"INT4"
+VLLM_ROCM_USE_AITER,0,Enables aiter ops.|Acts as a parent switch to enable the rest of the other operations.|AITER is mostly useful in MoE architectures such as GPT-OSS Qwen and DeepSeek because of the AITER MoE kernel which is enabled by default. For Dense models like Llama AITER does not usually offer an advantage over the custom paged attention kernel.,4,vllm,moe,0d|1m
+VLLM_ROCM_USE_AITER_FP8BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.|This kernel is invoked in `MLACommonImpl` and named `aiter_triton_fp8_bmm`.,2,vllm,fp8|mla,1
+VLLM_ROCM_USE_AITER_LINEAR,1,Uses AITER linear op **IF** AITER ops are enabled.|The following list of related ops: `scaled_mm` (per-tensor / rowwise).|It was found to be statistically associated with performance boost in experiments with Qwen3-235B-A22-FP8 model.,2,vllm,moe|qwen,X|1|0
+VLLM_ROCM_USE_AITER_MHA,1,Uses AITER MHA op.|`VLLM_ROCM_USE_AITER_MHA` is the default AITER attention mechanism. That is why it is also activated by default.|This setting is recommended to be set to 0 for most models.,3,vllm,moe|dense,0
+VLLM_ROCM_USE_AITER_MLA,1,Uses AITER MLA (Latent Attention) op.,2,vllm,moe|dense,1
+VLLM_ROCM_USE_AITER_MOE,1,Uses AITER MoE op.,2,vllm,moe,1
+VLLM_ROCM_USE_AITER_PAGED_ATTN,0,Uses AITER paged attention.,1,vllm,moe|dense|v0,0
+VLLM_ROCM_USE_AITER_RMSNORM,1,Enables AITER implementation of `RMSnorm`.|It was weakly associated with performance boost in Qwen3 and Llama3.3-70B.,4,vllm,dense,1
+VLLM_ROCM_FP8_MFMA_PAGE_ATTN,0,Uses the fp8 mfma in rocm paged attention else uses fp16.|Kernel that is affected id `torch.ops._rocm_C.paged_attention` under `paged_attention_rocm`.,2,vllm,fp8,0
+VLLM_ROCM_USE_SKINNY_GEMM,1,Uses ROCm skinny GEMMs.|This skinny GEMM kernels are useful for unquantized linear on ROCm.,2,vllm,bf16|fp16|bsleq4,1
+VLLM_USE_AITER_UNIFIED_ATTENTION,0,Uses AITER triton unified attention.|Activated with `VLLM_ROCM_USE_AITER_MHA` set to `0`.| Sets `self.unified_attention = aiter.ops.triton.unified_attention.unified_attention` inside `TritonAttentionImpl`.|It was strongly associated with performance boost in DeepSeek-R1 Qwen3 and gpt-oss.,4,vllm,gpt|moe,1g|1m|0d
+VLLM_USE_TRITON_FLASH_ATTN,1,Enable Triton flash attention. Used by default especially in ROCm systems.|If platform is ROCm we need to set `VLLM_USE_TRITON_FLASH_ATTN=0` for phi3v & paligemma models because ROCm Triton FA can run into shared memory issues with these models use other backends in the meantime. There is a similar note under the `test_quark` file for the Quark model test.|The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol: `VLLM_USE_TRITON_FLASH_ATTN=0`,1,vllm,v0,X|1|0
+VLLM_V1_USE_PREFILL_DECODE_ATTENTION,0,Use separate prefill and decode kernels for V1 attention instead of the unified triton kernel.|It usually improves prefill performance at the cost of higher gpu memory utilization.|If activated then uses `PagedAttention.split_kv_cache()` and `chunked_prefill_paged_decode`.|It was strongly associated with performance boost in Llama3.3-70B and Llama3.1-405B. On gfx942 it was also associated with performance boost in DeepSeek-R1.,4,vllm,dense|llama,1d|0m