Merge pull request #688 from ROCm/akaratza_env_vars

AndreasKaratzas · web-flow · commit c0622cc0ffb2 · 2025-09-22T16:44:42.000-05:00
Akaratza env vars
diff --git a/docs/configuration/rocm_env_vars.csv b/docs/configuration/rocm_env_vars.csv
@@ -1,50 +1,20 @@
 NAME,DEFAULT,DESCRIPTION,BUCKET,TEAM,LABELS,RECOMMENDED
-HSA_NO_SCRATCH_RECLAIM,0,If `HSA_NO_SCRATCH_RECLAIM` is set to 1 it disables a memory optimization that sometimes causes problems on smaller VRAM GPUs.,3,rocm,moe,1
-HSA_ENABLE_SDMA,0,If the underlying hardware has limitations regarding SDMA engines (DMA copy engines) and you suspect issues during large data transfers then you can try disabling SDMA in ROCm by setting `HSA_ENABLE_SDMA` to 0 before running the workload. This forces ROCm to use PCIe for transfers instead of the GPU's DMA engines. It might avoid certain hangs at the cost of performance.,4,rocm,NaN,0
-AMDGCN_SCALARIZE_PACKED_FOPS,0,Breakdown packed math ops such as `v_pk_mul` `v_pk_add` into un-packed version i.e. `v_mul` `v_add`. Packed math instructions cannot co-execute with mfma. It's better to increase the number of instructions that can overlap with mfma than having fewer instructions not hidden by mfma for a compute bound kernel.,2,triton,moe,0
-AMDGCN_USE_BUFFER_ATOMICS,0,Enable `buffer_atomics` instructions to atomically update data onto global memory.,2,triton,NaN,0
-AMDGCN_USE_BUFFER_OPS,0,Enable `buffer_load`/`store` instructions to load/store data from/to global memory. `buffer_load`/`store` has the following benefits/limitations compared to global_load/store: (i) Fewer vgprs are required for addresses (ii) Offset must be within 32-bit (iii) No need for if-else branches to implement masked load/store.,3,triton,dense,1
-TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE,0,Optimizes the swizzling computations of direct-to-lds loads to avoid a ds_permute by doing pointer arithmetic which generally improves performance. It's safe for all workloads.,1,triton,dense,depr|1d|0m
-TRITON_HIP_ASYNC_FAST_SWIZZLE,0,Changes the address calculations for direct-to-lds loads to help LLVM hoist more computations in front of the loop. It's safe for all workloads.,1,triton,dense,depr|1d|0m
-TRITON_HIP_GLOBAL_PREFETCH,0,Sets the number of stages between load and local_store. This is used for simple experiment of the pipeline design. Users should **NOT** use it.,2,triton,testing,0
-TRITON_HIP_LOCAL_PREFETCH,0,Sets the number of stages between local_load and dot. Again users should **NOT** use it.,2,triton,testing,0
-TRITON_HIP_USE_ASYNC_COPY,0,Enable loading data from global memory into LDS a.k.a. Direct-to-LDS.,4,triton,dense,1d|0m
-TRITON_HIP_USE_BLOCK_PINGPONG,0,Enable pingpong scheduling for matmul kernels.|There is a WIP for a new IR design for Pingpong.,4,triton,dense,1d|0m
-TRITON_HIP_USE_IN_THREAD_TRANSPOSE,0,Transpose the loaded elements within thread before storing them to LDS. This guarantees largest vectorization when doing ds_read for mfma instructions with a price of potential sacrifice of ds_write vectorization.|This is a workaround solution on MI300 to deal with the case where loading a tensor that is not k-contig. On MI350 this is replaced with ds_read_tr instructions.,2,triton,mi300,0
-TRITON_HIP_PRESHUFFLE_SCALES,1,Apply preshuffling for mxfp4 scales for ROCm backend.,4,vllm,moe|fp4,1m|0d
 VLLM_ROCM_CUSTOM_PAGED_ATTN,1,Custom paged attention kernel for MI3* cards.|`VLLM_ROCM_CUSTOM_PAGED_ATTN` is recommended versus AITER because right now there may be some inaccuracy issues with AITER. Specifically AITER has been found sometimes to be unstable meaning that it will randomly either break or have a windows of inaccurate results. My experiments were only on throughput so I have not verified those claims. This custom ROCm kernel so far as never broken or returned any inaccurate results. That is why it oi also by default activated in upstream.,3,vllm,v0,1
 VLLM_ROCM_FP8_PADDING,1,Pad the fp8 weights to 256 bytes for ROCm.|This is an optimization on ROCm platform which can benefit from tensors located far enough from one another in memory.|This is an fp8 only setting.,3,vllm,gfx942,X|1|0
 VLLM_ROCM_MOE_PADDING,1,Pad the weights for the MoE kernel.|This is only for Fused MoE,3,vllm,gfx942,X|1|0
 VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16,1,Converts input from bf16 to fp16.|Due to the lack of the bfloat16 asm instruction sometimes bfloat16 kernels are slower than fp16.,3,vllm,bf16|fp16,1
-VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB,NaN,Controls the maximum allowed number of data bytes (MB) for custom quick `allreduce` communication.|The default is later modified to 2048 MB if variable is `None`. Data exceeding this size will use either custom allreduce or RCCL communication.,2,vllm,bf16|fp16,NaN
-VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,bf16|fp16,NaN
+VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB,None,Controls the maximum allowed number of data bytes (MB) for custom quick `allreduce` communication.|The default is later modified to 2048 MB if variable is `None`. Data exceeding this size will use either custom allreduce or RCCL communication.,2,vllm,bf16|fp16,None
+VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,bf16|fp16,"NONE"
 VLLM_ROCM_USE_AITER,0,Enables aiter ops.|Acts as a parent switch to enable the rest of the other operations.|`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_ROCM_USE_AITER=1`.|Some use cases include: `amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV` `amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV`.,3,vllm,moe|dense,1
-VLLM_ROCM_USE_AITER_CK_TILE_LINEAR,1,CK stands for Composable Kernels which is a library of high performing kernels inside AMD. It's composable because it uses building blocks to dynamically compose good kernels depending on input shapes. Tile means that this implementation was re-based to a tile one.,4,vllm,NaN,beta
+VLLM_ROCM_USE_AITER_FP8BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.|This kernel is invoked in `MLACommonImpl` and named `aiter_triton_fp8_bmm`.,4,vllm,fp8|mla,1
 VLLM_ROCM_USE_AITER_LINEAR,1,Uses AITER linear op **IF** AITER ops are enabled.|The following list of related ops: `scaled_mm` (per-tensor / rowwise).,3,vllm,moe|dense,X|1|0
 VLLM_ROCM_USE_AITER_MHA,1,Uses AITER MHA op.|`VLLM_ROCM_USE_AITER_MHA` is the default AITER attention mechanism. That is why it is also activated by default. The reason that in quantized models I did not observe any regression is that they are quantized and hence there may be some other fused kernel taking its place hich has a comparable performance.,1,vllm,moe|dense,1
 VLLM_ROCM_USE_AITER_MLA,1,Uses AITER MLA (Latent Attention) op.,3,vllm,moe|dense,X|1|0
 VLLM_ROCM_USE_AITER_MOE,1,Uses AITER MoE op.,4,vllm,moe,X|0|1
 VLLM_ROCM_USE_AITER_PAGED_ATTN,0,Uses AITER paged attention.,1,vllm,moe|dense|v0,X|0|1
-VLLM_ROCM_USE_AITER_RMSNORM,0,Enables AITER implementation of `RMSnorm`.,4,vllm,dense,1d|0m
-VLLM_ROCM_USE_AITER_TRITON_LINEAR,0,This flag is mutually exclusive with `VLLM_ROCM_USE_AITER_LINEAR`.|It activates `aiter.ops.triton.gemm_a8w8_blockscale` for the `gemm_a8w8_blockscale` which utilizes the Triton backend while the other flag activates `aiter.gemm_a8w8_blockscale` inside `rocm_aiter_gemm_w8a8_blockscale_impl` which utilizes CK/ASM (file: vllm/model_executor/layers/quantization/utils/fp8_utils.py).,4,vllm,NaN,beta
-VLLM_ROCM_USE_AITER_TRITON_BF16_GEMM,1,Utilizes `gemm_a16w16` kernel from `aiter.ops.triton.gemm_a16w16` instead of `torch.nn.Linear` for routing inside the `MLPBlock`.|This is a GPT-OSS-specific variable.|This is not an upstream variable.,4,vllm,gpt|moe|bf16,beta
-VLLM_ROCM_USE_AITER_TRITON_FP8_BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.,4,vllm,fp8|mla,beta
-VLLM_ROCM_USE_AITER_TRITON_FP8_BMM_MAX_BATCH_SIZE,256,Sets batch size for AITER FP8 batched GEMM kernel **IF** activated.,4,vllm,fp8|mla,beta
-VLLM_ROCM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD,1,Activates AITER `fused_add_rmsnorm_pad` kernel versus Triton `gemm_a16w16` kernel.|This is a GPT-OSS-specific variable.,4,vllm,gpt|moe,X|1|0
-VLLM_ROCM_USE_AITER_TRITON_FUSED_MUL_ADD,1,Uses AITER Triton fused elementwise multiply + elementwise addtion.|This is a DeepSeekV2-specific variable.,4,vllm,deepseekv2,beta
-VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT,1,Uses AITER Triton fused RMSNORM + Quantization.|This is a DeepSeekV2-specific variable.,4,vllm,deepseekv2,beta
-VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE,1,Use AITER Triton fused rope + zeros + `reshape_and_cache`.|It works together with `AiterMLAImpl` and `TritonAttentionImpl` (depending on which is activated for unified attention -- see file `vllm/attention/layer.py` func `unified_attention_with_output`).,3,vllm,gpt|llama|deepseekv2|triton,1
-VLLM_ROCM_USE_AITER_TRITON_SILU_MUL_FP4_QUANT,0,Activates customize `SwiGLU`.,4,vllm,fp4,beta
-VLLM_ROCM_USE_AITER_TRITON_SILU_MUL_FP8_QUANT,1,This is a DeepSeekV2-specific variable.|It activates the `act_mul_and_fp8_group_quant` activation from AITER Triton and defines `act_mul_and_fp8_group_quant_impl` implementation.,4,vllm,fp8,beta
+VLLM_ROCM_USE_AITER_RMSNORM,1,Enables AITER implementation of `RMSnorm`.,4,vllm,dense,1d|0m
+VLLM_ROCM_FP8_MFMA_PAGE_ATTN,0,Uses the fp8 mfma in rocm paged attention else uses fp16.|Kernel that is affected id `torch.ops._rocm_C.paged_attention` under `paged_attention_rocm`.,4,vllm,fp8,0
 VLLM_ROCM_USE_SKINNY_GEMM,1,Uses ROCm skinny GEMMs.|This skinny GEMM kernels are useful for unquantized linear on ROCm.,3,vllm,bf16|fp16|bsleq4,1
-VLLM_TRITON_FP4_GEMM_BPRESHUFFLE,0,Deprecated,1,vllm,fp4,Deprecated.|Did not find in `https://github.com/ROCm/vllm/tree/355_wip`.
-VLLM_TRITON_FP4_GEMM_SPLITK_USE_BF16,0,Deprecated,1,vllm,fp4,Deprecated.|Did not find in `https://github.com/ROCm/vllm/tree/355_wip`.
-VLLM_TRITON_FP4_GEMM_USE_ASM,0,Uses AITER fp4 GEMM ASM.|Found in activation layer for `SiluAndMul` and quantized `gemm_with_dynamic_quant`.,4,vllm,fp4|llama3.1-MXFP4|dense,1l|Xee|0
-VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD,0,GPT-OSS-specific variable that activates `fused_add_rmsnorm_pad` kernel instead of traditional `RMSNorm` kernel.,4,vllm,gpt,1g|Xee|0
-VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE,0,GPT-OSS-specific variable that activates `fused_qkv_split_qk_rope` kernel insatead of `QKVParallelLinear` which is defaulted otherwise.,4,vllm,gpt,1g|Xee|0
-VLLM_USE_AITER_TRITON_GEMM,0,Deprecated.,1,vllm,moe,depr|1m|0d
-VLLM_USE_AITER_TRITON_ROPE,0,Uses AITER Rope.|Activates `torch.ops.vllm.rocm_aiter_rotary_emb_with_key_forward_triton` instead of `vllm._custom_ops.batched_rotary_embedding` or plain `vllm._custom_ops.rotary_embedding` inside the `RotaryEmbedding` <- `TritonAttentionImpl` kernel.,4,vllm,dense|llama,1d|0m
 VLLM_USE_AITER_UNIFIED_ATTENTION,0,Uses AITER triton unified attention.|Activated with `VLLM_ROCM_USE_AITER_MHA` set to `0`.| Sets `self.unified_attention = aiter.ops.triton.unified_attention.unified_attention` inside `TritonAttentionImpl`.,4,vllm,gpt|moe,1g|1m|0d
-VLLM_USE_ROCM_FP8_FLASH_ATTN,1,Uses quantized <q k v softmax(qk^T)> attn output during prefill.|AITER must be set to `0`.,2,vllm,fp8,X|1|0
 VLLM_USE_TRITON_FLASH_ATTN,1,Enable Triton flash attention. Used by default especially in ROCm systems.|If platform is ROCm we need to set `VLLM_USE_TRITON_FLASH_ATTN=0` for phi3v & paligemma models because ROCm Triton FA can run into shared memory issues with these models use other backends in the meantime. There is a similar note under the `test_quark` file for the Quark model test.|The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol: `VLLM_USE_TRITON_FLASH_ATTN=0`,2,vllm,v0,X|1|0
 VLLM_V1_USE_PREFILL_DECODE_ATTENTION,0,Use separate prefill and decode kernels for V1 attention instead of the unified triton kernel.|It usually improves prefill performance at the cost of higher gpu memory utilization.|If activated then uses `PagedAttention.split_kv_cache()` and `chunked_prefill_paged_decode`.,4,vllm,dense|llama,1d|0m