Skip to content

Commit 7efcbd8

Browse files
author
Claude Code
committed
set --enable_fa3 to --disable_fa3
1 parent 16c8c79 commit 7efcbd8

File tree

24 files changed

+64
-61
lines changed

24 files changed

+64
-61
lines changed

docs/CN/source/tutorial/api_server_args_zh.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -333,9 +333,9 @@ attention类型选择参数
333333

334334
推理后端将为解码使用 flashinfer 的注意力 kernel
335335

336-
.. option:: --enable_fa3
336+
.. option:: --disable_fa3
337337

338-
推理后端将为预填充和解码使用 fa3 注意力 kernel
338+
推理后端将不为预填充和解码使用 fa3 注意力 kernel(FA3 默认启用)
339339

340340
.. option:: --disable_cudagraph
341341

docs/CN/source/tutorial/deepseek_deployment.rst

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,11 @@ LightLLM 支持以下几种部署模式:
3232
# H200 单机 DeepSeek-R1 TP 模式
3333
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
3434
--model_dir /path/DeepSeek-R1 \
35-
--tp 8 \
36-
--enable_fa3
35+
--tp 8
3736
3837
**参数说明:**
3938
- `LOADWORKER=18`: 模型加载线程数,提高加载速度
4039
- `--tp 8`: 张量并行度,使用8个GPU
41-
- `--enable_fa3`: 启用 Flash Attention 3.0
4240
- `--port 8088`: 服务端口
4341

4442
1.2 单机 DP + EP 模式 (Data Parallel + Expert Parallel)
@@ -55,13 +53,11 @@ LightLLM 支持以下几种部署模式:
5553
--model_dir /path/DeepSeek-R1 \
5654
--tp 8 \
5755
--dp 8 \
58-
--enable_fa3
5956
6057
**参数说明:**
6158
- `MOE_MODE=EP`: 设置专家并行模式
6259
- `--tp 8`: 张量并行度
6360
- `--dp 8`: 数据并行度,通常设置为与 tp 相同的值
64-
- `--enable_fa3`: 启用 Flash Attention 3.0
6561

6662
**可选优化参数:**
6763
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
@@ -85,7 +81,6 @@ LightLLM 支持以下几种部署模式:
8581
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
8682
--model_dir /path/DeepSeek-R1 \
8783
--tp 16 \
88-
--enable_fa3 \
8984
--nnodes 2 \
9085
--node_rank 0 \
9186
--nccl_host $nccl_host \
@@ -101,7 +96,6 @@ LightLLM 支持以下几种部署模式:
10196
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
10297
--model_dir /path/DeepSeek-R1 \
10398
--tp 16 \
104-
--enable_fa3 \
10599
--nnodes 2 \
106100
--node_rank 1 \
107101
--nccl_host $nccl_host \
@@ -129,7 +123,6 @@ LightLLM 支持以下几种部署模式:
129123
--model_dir /path/DeepSeek-R1 \
130124
--tp 16 \
131125
--dp 16 \
132-
--enable_fa3 \
133126
--nnodes 2 \
134127
--node_rank 0 \
135128
--nccl_host $nccl_host \
@@ -146,7 +139,6 @@ LightLLM 支持以下几种部署模式:
146139
--model_dir /path/DeepSeek-R1 \
147140
--tp 16 \
148141
--dp 16 \
149-
--enable_fa3 \
150142
--nnodes 2 \
151143
--node_rank 1 \
152144
--nccl_host $nccl_host \
@@ -195,7 +187,6 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
195187
--host $host \
196188
--port 8019 \
197189
--nccl_port 2732 \
198-
--enable_fa3 \
199190
--disable_cudagraph \
200191
--pd_master_ip $pd_master_ip \
201192
--pd_master_port 60011
@@ -219,7 +210,6 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
219210
--host $host \
220211
--port 8121 \
221212
--nccl_port 12322 \
222-
--enable_fa3 \
223213
--disable_cudagraph \
224214
--pd_master_ip $pd_master_ip \
225215
--pd_master_port 60011
@@ -287,7 +277,6 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
287277
--tp 8 \
288278
--dp 8 \
289279
--nccl_port 2732 \
290-
--enable_fa3 \
291280
--disable_cudagraph \
292281
--config_server_host $config_server_host \
293282
--config_server_port 60088
@@ -306,7 +295,6 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
306295
--nccl_port 12322 \
307296
--tp 8 \
308297
--dp 8 \
309-
--enable_fa3 \
310298
--config_server_host $config_server_host \
311299
--config_server_port 60088
312300
# 如果需要启用微批次重叠,可以取消注释以下行

docs/EN/source/tutorial/api_server_args_zh.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -332,9 +332,9 @@ Performance Optimization Parameters
332332

333333
The inference backend will use flashinfer's attention kernel for decoding
334334

335-
.. option:: --enable_fa3
335+
.. option:: --disable_fa3
336336

337-
The inference backend will use fa3 attention kernel for prefill and decoding
337+
The inference backend will not use fa3 attention kernel for prefill and decoding (FA3 is enabled by default)
338338

339339
.. option:: --disable_cudagraph
340340

docs/EN/source/tutorial/deepseek_deployment.rst

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,11 @@ Suitable for deploying DeepSeek-R1 model on a single H200 node.
3232
# H200 Single node DeepSeek-R1 TP Mode
3333
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
3434
--model_dir /path/DeepSeek-R1 \
35-
--tp 8 \
36-
--enable_fa3
35+
--tp 8
3736
3837
**Parameter Description:**
3938
- `LOADWORKER=18`: Model loading thread count, improves loading speed
4039
- `--tp 8`: Tensor parallelism, using 8 GPUs
41-
- `--enable_fa3`: Enable Flash Attention 3.0
4240
- `--port 8088`: Service port
4341

4442
1.2 Single node DP + EP Mode (Data Parallel + Expert Parallel)
@@ -55,13 +53,11 @@ Suitable for expert parallelism deployment of MoE models like DeepSeek-V2/V3.
5553
--model_dir /path/DeepSeek-R1 \
5654
--tp 8 \
5755
--dp 8 \
58-
--enable_fa3
5956
6057
**Parameter Description:**
6158
- `MOE_MODE=EP`: Set expert parallelism mode
6259
- `--tp 8`: Tensor parallelism
6360
- `--dp 8`: Data parallelism, usually set to the same value as tp
64-
- `--enable_fa3`: Enable Flash Attention 3.0
6561

6662
**Optional Optimization Parameters:**
6763
- `--enable_prefill_microbatch_overlap`: Enable prefill microbatch overlap
@@ -85,7 +81,6 @@ Suitable for deployment across multiple H200/H100 nodes.
8581
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
8682
--model_dir /path/DeepSeek-R1 \
8783
--tp 16 \
88-
--enable_fa3 \
8984
--nnodes 2 \
9085
--node_rank 0 \
9186
--nccl_host $nccl_host \
@@ -101,7 +96,6 @@ Suitable for deployment across multiple H200/H100 nodes.
10196
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
10297
--model_dir /path/DeepSeek-R1 \
10398
--tp 16 \
104-
--enable_fa3 \
10599
--nnodes 2 \
106100
--node_rank 1 \
107101
--nccl_host $nccl_host \
@@ -129,7 +123,6 @@ Suitable for deploying MoE models across multiple nodes.
129123
--model_dir /path/DeepSeek-R1 \
130124
--tp 16 \
131125
--dp 16 \
132-
--enable_fa3 \
133126
--nnodes 2 \
134127
--node_rank 0 \
135128
--nccl_host $nccl_host \
@@ -146,7 +139,6 @@ Suitable for deploying MoE models across multiple nodes.
146139
--model_dir /path/DeepSeek-R1 \
147140
--tp 16 \
148141
--dp 16 \
149-
--enable_fa3 \
150142
--nnodes 2 \
151143
--node_rank 1 \
152144
--nccl_host $nccl_host \
@@ -195,7 +187,6 @@ PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for
195187
--host $host \
196188
--port 8019 \
197189
--nccl_port 2732 \
198-
--enable_fa3 \
199190
--disable_cudagraph \
200191
--pd_master_ip $pd_master_ip
201192
@@ -216,7 +207,6 @@ PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for
216207
--host $host \
217208
--port 8121 \
218209
--nccl_port 12322 \
219-
--enable_fa3 \
220210
--disable_cudagraph \
221211
--pd_master_ip $pd_master_ip \
222212
--pd_master_port 60011
@@ -284,7 +274,6 @@ Supports multiple PD Master nodes, providing better load balancing and high avai
284274
--tp 8 \
285275
--dp 8 \
286276
--nccl_port 2732 \
287-
--enable_fa3 \
288277
--disable_cudagraph \
289278
--config_server_host $config_server_host \
290279
--config_server_port 60088
@@ -303,7 +292,6 @@ Supports multiple PD Master nodes, providing better load balancing and high avai
303292
--nccl_port 12322 \
304293
--tp 8 \
305294
--dp 8 \
306-
--enable_fa3 \
307295
--config_server_host $config_server_host \
308296
--config_server_port 60088
309297
# if you want to enable microbatch overlap, you can uncomment the following lines

lightllm/common/offline_fp8_quant_mem_manager.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def __init__(
3232
self.abs_max = None
3333

3434
if is_export_mode:
35-
scales_shape = [layer_num, 2 * head_num] if get_env_start_args().enable_fa3 else [layer_num, 2]
35+
scales_shape = [layer_num, 2 * head_num] if not get_env_start_args().disable_fa3 else [layer_num, 2]
3636
self.abs_max = torch.zeros(scales_shape, dtype=torch.float32, device="cuda")
3737
elif get_env_start_args().kv_quant_calibration_config_path is not None:
3838
logger.info(
@@ -43,15 +43,15 @@ def __init__(
4343

4444
self.scales_list = cfg["scales"]
4545
self.scales = torch.tensor(self.scales_list, dtype=torch.float32, device="cuda").view(cfg["scales_shape"])
46-
if not get_env_start_args().enable_fa3:
46+
if get_env_start_args().disable_fa3:
4747
self.scales = torch.repeat_interleave(self.scales, head_num, dim=-1)
4848
elif cfg["num_head"] > self.total_head_num:
4949
factor = cfg["num_head"] // self.total_head_num
5050
self.scales = self.scales[..., ::factor].contiguous()
5151
elif cfg["num_head"] < self.total_head_num:
5252
factor = self.total_head_num // cfg["num_head"]
5353
self.scales = torch.repeat_interleave(self.scales, factor, dim=-1).contiguous()
54-
if get_env_start_args().enable_fa3 and dist.is_initialized() and dist.get_world_size() > 1:
54+
if not get_env_start_args().disable_fa3 and dist.is_initialized() and dist.get_world_size() > 1:
5555
half_head = self.total_head_num // 2
5656
start_head = dist.get_rank() * head_num
5757
end_head = start_head + head_num
@@ -86,7 +86,7 @@ def _load_and_check_config(self):
8686
raise ValueError(
8787
f"num_head {cfg['num_head']} in config " f"not match current model head num {self.total_head_num}"
8888
)
89-
if get_env_start_args().enable_fa3:
89+
if not get_env_start_args().disable_fa3:
9090
if cfg["quant_type"] != "per_head":
9191
raise ValueError(f"quant type {cfg['num_head']} in config not match fa3 backend")
9292
else:
@@ -109,7 +109,7 @@ def update_calibration_data(self, kv_buffer: torch.Tensor, layer_index: int):
109109
logger.info("kv cache calibration mode will collect kv cache data for quantization calibration")
110110

111111
if self.abs_max is not None and self.count >= warmup_counts:
112-
if get_env_start_args().enable_fa3:
112+
if not get_env_start_args().disable_fa3:
113113
kv_max = kv_buffer.abs().amax(dim=(0, 2)).to(torch.float32)
114114
else:
115115
k_max = kv_buffer[:, : self.head_num, :].abs().amax(dim=()).to(torch.float32)
@@ -119,7 +119,7 @@ def update_calibration_data(self, kv_buffer: torch.Tensor, layer_index: int):
119119
if self.count == warmup_counts + inference_counts - 1 and layer_index == self.layer_num - 1:
120120
final_abs_max = self.abs_max
121121
if dist.is_initialized() and dist.get_world_size() > 1:
122-
if get_env_start_args().enable_fa3:
122+
if not get_env_start_args().disable_fa3:
123123
k_max, v_max = torch.chunk(self.abs_max, 2, dim=-1)
124124
k_max = k_max.contiguous()
125125
v_max = v_max.contiguous()
@@ -148,7 +148,7 @@ def _export_calibration_data(self):
148148
cfg = {
149149
"version": "1.0",
150150
"architectures": model_arch,
151-
"quant_type": "per_head" if get_env_start_args().enable_fa3 else "per_tensor",
151+
"quant_type": "per_head" if not get_env_start_args().disable_fa3 else "per_tensor",
152152
"qmin": self.qmin,
153153
"qmax": self.qmax,
154154
"num_layers": self.layer_num,

lightllm/models/deepseek2/layer_infer/transformer_layer_infer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def _bind_attention(self):
9595
)
9696
else:
9797
self._copy_kv_to_mem_cache = partial(Deepseek2TransformerLayerInfer._copy_kv_to_mem_cache_normal, self)
98-
if get_env_start_args().enable_fa3:
98+
if not get_env_start_args().disable_fa3:
9999
self._token_attention_kernel = partial(
100100
Deepseek2TransformerLayerInfer._token_gqa_decode_attention_flashattention, self
101101
)
@@ -118,7 +118,7 @@ def _bind_attention(self):
118118
Deepseek2TransformerLayerInfer._context_attention_kernel_with_CC_fp8, self
119119
)
120120
else:
121-
if get_env_start_args().enable_fa3:
121+
if not get_env_start_args().disable_fa3:
122122
self._context_attention_kernel = partial(
123123
Deepseek2TransformerLayerInfer._context_attention_flashattention_kernel_with_CC, self
124124
)

lightllm/models/deepseek2/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def __init__(self, kvargs):
6969
return
7070

7171
def _init_inferstate_cls(self):
72-
if get_env_start_args().enable_fa3:
72+
if not get_env_start_args().disable_fa3:
7373
self.infer_state_class = Deepseek2FlashAttentionStateInfo
7474
elif self.enable_flashinfer:
7575
self.infer_state_class = Deepseek2FlashInferStateInfo

lightllm/models/llama/layer_infer/transformer_layer_infer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ def _bind_norm(self):
6868
return
6969

7070
def _bind_attention(self):
71-
if get_env_start_args().enable_fa3:
71+
if not get_env_start_args().disable_fa3:
7272
if "offline_calibration_fp8kv" in self.mode:
7373
self._context_attention_kernel = partial(
7474
LlamaTransformerLayerInfer._context_attention_flashattention_fp8, self

lightllm/models/llama/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def _init_mem_manager(self):
9797
return
9898

9999
def _init_inferstate_cls(self):
100-
if get_env_start_args().enable_fa3:
100+
if not get_env_start_args().disable_fa3:
101101
self.infer_state_class = FlashAttentionStateInfo
102102
elif self.enable_flashinfer:
103103
self.infer_state_class = LlamaFlashInferStateInfo

lightllm/models/qwen2_vl/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ def __init__(self, kvargs):
106106
return
107107

108108
def _init_inferstate_cls(self):
109-
if get_env_start_args().enable_fa3:
109+
if not get_env_start_args().disable_fa3:
110110
self.infer_state_class = Qwen2VLFlashAttentionStateInfo
111111

112112
def _init_config(self):

0 commit comments

Comments
 (0)