Skip to content

Commit c0622cc

Browse files
Merge pull request #688 from ROCm/akaratza_env_vars
Akaratza env vars
2 parents feec0c2 + 7ac271d commit c0622cc

File tree

1 file changed

+5
-35
lines changed

1 file changed

+5
-35
lines changed
Lines changed: 5 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,20 @@
11
NAME,DEFAULT,DESCRIPTION,BUCKET,TEAM,LABELS,RECOMMENDED
2-
HSA_NO_SCRATCH_RECLAIM,0,If `HSA_NO_SCRATCH_RECLAIM` is set to 1 it disables a memory optimization that sometimes causes problems on smaller VRAM GPUs.,3,rocm,moe,1
3-
HSA_ENABLE_SDMA,0,If the underlying hardware has limitations regarding SDMA engines (DMA copy engines) and you suspect issues during large data transfers then you can try disabling SDMA in ROCm by setting `HSA_ENABLE_SDMA` to 0 before running the workload. This forces ROCm to use PCIe for transfers instead of the GPU's DMA engines. It might avoid certain hangs at the cost of performance.,4,rocm,NaN,0
4-
AMDGCN_SCALARIZE_PACKED_FOPS,0,Breakdown packed math ops such as `v_pk_mul` `v_pk_add` into un-packed version i.e. `v_mul` `v_add`. Packed math instructions cannot co-execute with mfma. It's better to increase the number of instructions that can overlap with mfma than having fewer instructions not hidden by mfma for a compute bound kernel.,2,triton,moe,0
5-
AMDGCN_USE_BUFFER_ATOMICS,0,Enable `buffer_atomics` instructions to atomically update data onto global memory.,2,triton,NaN,0
6-
AMDGCN_USE_BUFFER_OPS,0,Enable `buffer_load`/`store` instructions to load/store data from/to global memory. `buffer_load`/`store` has the following benefits/limitations compared to global_load/store: (i) Fewer vgprs are required for addresses (ii) Offset must be within 32-bit (iii) No need for if-else branches to implement masked load/store.,3,triton,dense,1
7-
TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE,0,Optimizes the swizzling computations of direct-to-lds loads to avoid a ds_permute by doing pointer arithmetic which generally improves performance. It's safe for all workloads.,1,triton,dense,depr|1d|0m
8-
TRITON_HIP_ASYNC_FAST_SWIZZLE,0,Changes the address calculations for direct-to-lds loads to help LLVM hoist more computations in front of the loop. It's safe for all workloads.,1,triton,dense,depr|1d|0m
9-
TRITON_HIP_GLOBAL_PREFETCH,0,Sets the number of stages between load and local_store. This is used for simple experiment of the pipeline design. Users should **NOT** use it.,2,triton,testing,0
10-
TRITON_HIP_LOCAL_PREFETCH,0,Sets the number of stages between local_load and dot. Again users should **NOT** use it.,2,triton,testing,0
11-
TRITON_HIP_USE_ASYNC_COPY,0,Enable loading data from global memory into LDS a.k.a. Direct-to-LDS.,4,triton,dense,1d|0m
12-
TRITON_HIP_USE_BLOCK_PINGPONG,0,Enable pingpong scheduling for matmul kernels.|There is a WIP for a new IR design for Pingpong.,4,triton,dense,1d|0m
13-
TRITON_HIP_USE_IN_THREAD_TRANSPOSE,0,Transpose the loaded elements within thread before storing them to LDS. This guarantees largest vectorization when doing ds_read for mfma instructions with a price of potential sacrifice of ds_write vectorization.|This is a workaround solution on MI300 to deal with the case where loading a tensor that is not k-contig. On MI350 this is replaced with ds_read_tr instructions.,2,triton,mi300,0
14-
TRITON_HIP_PRESHUFFLE_SCALES,1,Apply preshuffling for mxfp4 scales for ROCm backend.,4,vllm,moe|fp4,1m|0d
152
VLLM_ROCM_CUSTOM_PAGED_ATTN,1,Custom paged attention kernel for MI3* cards.|`VLLM_ROCM_CUSTOM_PAGED_ATTN` is recommended versus AITER because right now there may be some inaccuracy issues with AITER. Specifically AITER has been found sometimes to be unstable meaning that it will randomly either break or have a windows of inaccurate results. My experiments were only on throughput so I have not verified those claims. This custom ROCm kernel so far as never broken or returned any inaccurate results. That is why it oi also by default activated in upstream.,3,vllm,v0,1
163
VLLM_ROCM_FP8_PADDING,1,Pad the fp8 weights to 256 bytes for ROCm.|This is an optimization on ROCm platform which can benefit from tensors located far enough from one another in memory.|This is an fp8 only setting.,3,vllm,gfx942,X|1|0
174
VLLM_ROCM_MOE_PADDING,1,Pad the weights for the MoE kernel.|This is only for Fused MoE,3,vllm,gfx942,X|1|0
185
VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16,1,Converts input from bf16 to fp16.|Due to the lack of the bfloat16 asm instruction sometimes bfloat16 kernels are slower than fp16.,3,vllm,bf16|fp16,1
19-
VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB,NaN,Controls the maximum allowed number of data bytes (MB) for custom quick `allreduce` communication.|The default is later modified to 2048 MB if variable is `None`. Data exceeding this size will use either custom allreduce or RCCL communication.,2,vllm,bf16|fp16,NaN
20-
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,bf16|fp16,NaN
6+
VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB,None,Controls the maximum allowed number of data bytes (MB) for custom quick `allreduce` communication.|The default is later modified to 2048 MB if variable is `None`. Data exceeding this size will use either custom allreduce or RCCL communication.,2,vllm,bf16|fp16,None
7+
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,"NONE",Custom quick allreduce kernel for MI3* cards.|Choice of quantization level: `FP` `INT8` `INT6` `INT4` or `NONE`.|Recommended for large models to get allreduce.,2,vllm,bf16|fp16,"NONE"
218
VLLM_ROCM_USE_AITER,0,Enables aiter ops.|Acts as a parent switch to enable the rest of the other operations.|`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_ROCM_USE_AITER=1`.|Some use cases include: `amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV` `amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV`.,3,vllm,moe|dense,1
22-
VLLM_ROCM_USE_AITER_CK_TILE_LINEAR,1,CK stands for Composable Kernels which is a library of high performing kernels inside AMD. It's composable because it uses building blocks to dynamically compose good kernels depending on input shapes. Tile means that this implementation was re-based to a tile one.,4,vllm,NaN,beta
9+
VLLM_ROCM_USE_AITER_FP8BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.|This kernel is invoked in `MLACommonImpl` and named `aiter_triton_fp8_bmm`.,4,vllm,fp8|mla,1
2310
VLLM_ROCM_USE_AITER_LINEAR,1,Uses AITER linear op **IF** AITER ops are enabled.|The following list of related ops: `scaled_mm` (per-tensor / rowwise).,3,vllm,moe|dense,X|1|0
2411
VLLM_ROCM_USE_AITER_MHA,1,Uses AITER MHA op.|`VLLM_ROCM_USE_AITER_MHA` is the default AITER attention mechanism. That is why it is also activated by default. The reason that in quantized models I did not observe any regression is that they are quantized and hence there may be some other fused kernel taking its place hich has a comparable performance.,1,vllm,moe|dense,1
2512
VLLM_ROCM_USE_AITER_MLA,1,Uses AITER MLA (Latent Attention) op.,3,vllm,moe|dense,X|1|0
2613
VLLM_ROCM_USE_AITER_MOE,1,Uses AITER MoE op.,4,vllm,moe,X|0|1
2714
VLLM_ROCM_USE_AITER_PAGED_ATTN,0,Uses AITER paged attention.,1,vllm,moe|dense|v0,X|0|1
28-
VLLM_ROCM_USE_AITER_RMSNORM,0,Enables AITER implementation of `RMSnorm`.,4,vllm,dense,1d|0m
29-
VLLM_ROCM_USE_AITER_TRITON_LINEAR,0,This flag is mutually exclusive with `VLLM_ROCM_USE_AITER_LINEAR`.|It activates `aiter.ops.triton.gemm_a8w8_blockscale` for the `gemm_a8w8_blockscale` which utilizes the Triton backend while the other flag activates `aiter.gemm_a8w8_blockscale` inside `rocm_aiter_gemm_w8a8_blockscale_impl` which utilizes CK/ASM (file: vllm/model_executor/layers/quantization/utils/fp8_utils.py).,4,vllm,NaN,beta
30-
VLLM_ROCM_USE_AITER_TRITON_BF16_GEMM,1,Utilizes `gemm_a16w16` kernel from `aiter.ops.triton.gemm_a16w16` instead of `torch.nn.Linear` for routing inside the `MLPBlock`.|This is a GPT-OSS-specific variable.|This is not an upstream variable.,4,vllm,gpt|moe|bf16,beta
31-
VLLM_ROCM_USE_AITER_TRITON_FP8_BMM,1,Uses AITER Triton fused FP8 per-token group quant + FP8 batched GEMM.,4,vllm,fp8|mla,beta
32-
VLLM_ROCM_USE_AITER_TRITON_FP8_BMM_MAX_BATCH_SIZE,256,Sets batch size for AITER FP8 batched GEMM kernel **IF** activated.,4,vllm,fp8|mla,beta
33-
VLLM_ROCM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD,1,Activates AITER `fused_add_rmsnorm_pad` kernel versus Triton `gemm_a16w16` kernel.|This is a GPT-OSS-specific variable.,4,vllm,gpt|moe,X|1|0
34-
VLLM_ROCM_USE_AITER_TRITON_FUSED_MUL_ADD,1,Uses AITER Triton fused elementwise multiply + elementwise addtion.|This is a DeepSeekV2-specific variable.,4,vllm,deepseekv2,beta
35-
VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP8_QUANT,1,Uses AITER Triton fused RMSNORM + Quantization.|This is a DeepSeekV2-specific variable.,4,vllm,deepseekv2,beta
36-
VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE,1,Use AITER Triton fused rope + zeros + `reshape_and_cache`.|It works together with `AiterMLAImpl` and `TritonAttentionImpl` (depending on which is activated for unified attention -- see file `vllm/attention/layer.py` func `unified_attention_with_output`).,3,vllm,gpt|llama|deepseekv2|triton,1
37-
VLLM_ROCM_USE_AITER_TRITON_SILU_MUL_FP4_QUANT,0,Activates customize `SwiGLU`.,4,vllm,fp4,beta
38-
VLLM_ROCM_USE_AITER_TRITON_SILU_MUL_FP8_QUANT,1,This is a DeepSeekV2-specific variable.|It activates the `act_mul_and_fp8_group_quant` activation from AITER Triton and defines `act_mul_and_fp8_group_quant_impl` implementation.,4,vllm,fp8,beta
15+
VLLM_ROCM_USE_AITER_RMSNORM,1,Enables AITER implementation of `RMSnorm`.,4,vllm,dense,1d|0m
16+
VLLM_ROCM_FP8_MFMA_PAGE_ATTN,0,Uses the fp8 mfma in rocm paged attention else uses fp16.|Kernel that is affected id `torch.ops._rocm_C.paged_attention` under `paged_attention_rocm`.,4,vllm,fp8,0
3917
VLLM_ROCM_USE_SKINNY_GEMM,1,Uses ROCm skinny GEMMs.|This skinny GEMM kernels are useful for unquantized linear on ROCm.,3,vllm,bf16|fp16|bsleq4,1
40-
VLLM_TRITON_FP4_GEMM_BPRESHUFFLE,0,Deprecated,1,vllm,fp4,Deprecated.|Did not find in `https://github.com/ROCm/vllm/tree/355_wip`.
41-
VLLM_TRITON_FP4_GEMM_SPLITK_USE_BF16,0,Deprecated,1,vllm,fp4,Deprecated.|Did not find in `https://github.com/ROCm/vllm/tree/355_wip`.
42-
VLLM_TRITON_FP4_GEMM_USE_ASM,0,Uses AITER fp4 GEMM ASM.|Found in activation layer for `SiluAndMul` and quantized `gemm_with_dynamic_quant`.,4,vllm,fp4|llama3.1-MXFP4|dense,1l|Xee|0
43-
VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD,0,GPT-OSS-specific variable that activates `fused_add_rmsnorm_pad` kernel instead of traditional `RMSNorm` kernel.,4,vllm,gpt,1g|Xee|0
44-
VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE,0,GPT-OSS-specific variable that activates `fused_qkv_split_qk_rope` kernel insatead of `QKVParallelLinear` which is defaulted otherwise.,4,vllm,gpt,1g|Xee|0
45-
VLLM_USE_AITER_TRITON_GEMM,0,Deprecated.,1,vllm,moe,depr|1m|0d
46-
VLLM_USE_AITER_TRITON_ROPE,0,Uses AITER Rope.|Activates `torch.ops.vllm.rocm_aiter_rotary_emb_with_key_forward_triton` instead of `vllm._custom_ops.batched_rotary_embedding` or plain `vllm._custom_ops.rotary_embedding` inside the `RotaryEmbedding` <- `TritonAttentionImpl` kernel.,4,vllm,dense|llama,1d|0m
4718
VLLM_USE_AITER_UNIFIED_ATTENTION,0,Uses AITER triton unified attention.|Activated with `VLLM_ROCM_USE_AITER_MHA` set to `0`.| Sets `self.unified_attention = aiter.ops.triton.unified_attention.unified_attention` inside `TritonAttentionImpl`.,4,vllm,gpt|moe,1g|1m|0d
48-
VLLM_USE_ROCM_FP8_FLASH_ATTN,1,Uses quantized <q k v softmax(qk^T)> attn output during prefill.|AITER must be set to `0`.,2,vllm,fp8,X|1|0
4919
VLLM_USE_TRITON_FLASH_ATTN,1,Enable Triton flash attention. Used by default especially in ROCm systems.|If platform is ROCm we need to set `VLLM_USE_TRITON_FLASH_ATTN=0` for phi3v & paligemma models because ROCm Triton FA can run into shared memory issues with these models use other backends in the meantime. There is a similar note under the `test_quark` file for the Quark model test.|The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol: `VLLM_USE_TRITON_FLASH_ATTN=0`,2,vllm,v0,X|1|0
5020
VLLM_V1_USE_PREFILL_DECODE_ATTENTION,0,Use separate prefill and decode kernels for V1 attention instead of the unified triton kernel.|It usually improves prefill performance at the cost of higher gpu memory utilization.|If activated then uses `PagedAttention.split_kv_cache()` and `chunked_prefill_paged_decode`.,4,vllm,dense|llama,1d|0m

0 commit comments

Comments
 (0)