Skip to content

Commit fc2bc3d

Browse files
authored
update router_aux_loss_coef (#5318)
1 parent 8cf7603 commit fc2bc3d

File tree

8 files changed

+22
-24
lines changed

8 files changed

+22
-24
lines changed

docs/source/Instruction/Megatron-SWIFT训练.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,8 @@ swift export \
404404
- moe_enable_deepep: 实验性功能,启用DeepSeek/DeepEP以实现 MoE 模型中的高效令牌分发与组合。仅在设置`--moe_token_dispatcher_type flex`使用灵活令牌分发器时生效。
405405
- 🔥moe_grouped_gemm: 当每个rank包含多个专家时,通过在多个流中启动多个本地 GEMM 内核,利用 TransformerEngine中的GroupedLinear提高利用率和性能。默认为False。
406406
- 🔥moe_permute_fusion: 在令牌分发过程中融合令牌重排操作。默认为False。
407-
- 🔥moe_aux_loss_coeff: 辅助损失的缩放系数:建议的初始值为 1e-2。默认为None。自动从config.json读取。
407+
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。
408+
- 注意:在"ms-swift<3.7.1",其默认为None,自动从config.json读取。
408409
- moe_z_loss_coeff: z-loss 的缩放系数。默认为None。
409410
- moe_expert_capacity_factor: 每个专家的容量因子,None表示不会丢弃任何token。默认为None。自动从config.json读取。
410411
- 🔥moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项,共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。

docs/source/Instruction/命令行参数.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,8 @@
163163
- 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab``--report_to all`
164164
- logging_first_step: 是否记录第一个step的日志,默认为True。
165165
- logging_steps: 日志打印间隔,默认为5。
166-
- router_aux_loss_coef: 用于moe模型训练时,设置 aux_loss 的权重。默认为None,使用config中值。若设置为0,则不计算 aux_loss。
166+
- router_aux_loss_coef: 用于moe模型训练时,设置 aux_loss 的权重,默认为`0.`
167+
- 注意:在"ms-swift==3.7.0",其默认为None,从config.json中读取,该行为在"ms-swift>=3.7.1"被修改。
167168
- logging_dir: tensorboard日志路径。默认为None,即设置为`f'{self.output_dir}/runs'`
168169
- predict_with_generate: 验证时使用生成式的方式,默认为False。
169170
- metric_for_best_model: 默认为None,即当`predict_with_generate`设置为False时,设置为'loss',否则设置为'rouge-l'(在PPO训练时,不进行默认值设置;GRPO训练设置为'reward')。

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,8 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
166166
- 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
167167
- logging_first_step: Whether to log the first step, defaults to True.
168168
- logging_steps: Interval for logging, defaults to 5.
169-
- router_aux_loss_coef: Weight for aux_loss when training MoE models. Defaults to None, meaning the value from the config is used. If set to 0, aux_loss is not computed.
169+
- router_aux_loss_coef: Sets the weight of the aux_loss when training MoE models; default is `0.`
170+
- Note: In ms-swift == 3.7.0, the default is None and the value is read from config.json; this behavior was changed starting with ms-swift >= 3.7.1.
170171
- logging_dir: The path for TensorBoard logs. Defaults to None, which means it is set to `f'{self.output_dir}/runs'`.
171172
- predict_with_generate: Whether to use generative method during validation, default is False.
172173
- metric_for_best_model: Default is None, which means that when predict_with_generate is set to False, it is set to 'loss'; otherwise, it is set to 'rouge-l' (during PPO training, the default value is not set; in GRPO training, it is set to 'reward').

docs/source_en/Instruction/Megatron-SWIFT-Training.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -421,7 +421,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
421421
- moe_enable_deepep: Experimental feature, Enables DeepSeek/DeepEP for efficient token dispatching and combination in MoE models. Only works when using the flexible token dispatcher by setting `--moe_token_dispatcher_type flex`.
422422
- 🔥moe_grouped_gemm: When each rank contains multiple experts, multiple local GEMM kernels can be launched in parallel streams to improve utilization and performance by using GroupedLinear from TransformerEngine. Default is False.
423423
- 🔥moe_permute_fusion: Fuses token permutation operations during token dispatch. Default is False.
424-
- 🔥moe_aux_loss_coeff: Scaling coefficient for the auxiliary loss; a recommended initial value is 1e-2. Default is None and is automatically read from config.json.
424+
- 🔥moe_aux_loss_coeff: Default is 0, which disables aux_loss.
425+
- Note: In ms-swift versions earlier than 3.7.1, the default is None and the value is automatically loaded from config.json.
425426
- moe_z_loss_coeff: Scaling coefficient for z-loss. Default is None.
426427
- moe_expert_capacity_factor: Capacity factor for each expert. None means no token will be dropped. Default is None and will be automatically read from config.json.
427428
- 🔥moe_shared_expert_overlap: Enables overlap between shared expert computation and the dispatcher. If not enabled, shared expert computation will be performed after routing experts. Only effective when `moe_shared_expert_intermediate_size` is set. Default is False.

swift/megatron/argument/megatron_args.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ class MegatronArguments(ExtraMegatronArguments):
217217
moe_enable_deepep: bool = False
218218
moe_grouped_gemm: bool = False
219219
moe_permute_fusion: bool = False
220-
moe_aux_loss_coeff: Optional[float] = None
220+
moe_aux_loss_coeff: float = 0.
221221
moe_z_loss_coeff: Optional[float] = None
222222
moe_expert_capacity_factor: Optional[float] = None
223223
moe_shared_expert_overlap: bool = False
@@ -315,8 +315,6 @@ def _set_default(self):
315315
self.moe_router_topk = 2
316316
if self.moe_router_pre_softmax is None:
317317
self.moe_router_pre_softmax = False
318-
if self.moe_aux_loss_coeff is None:
319-
self.moe_aux_loss_coeff = 0.
320318
if self.moe_router_load_balancing_type is None:
321319
self.moe_router_load_balancing_type = 'aux_loss'
322320
if self.moe_router_enable_expert_bias is None:

swift/megatron/model/config.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@
2727
'moe_router_topk': ['num_experts_per_tok', 'n_group', 'moe_topk', 'moe_k'],
2828
'num_experts': ['num_experts', 'n_routed_experts', 'moe_num_experts'],
2929
'moe_router_pre_softmax': ['norm_topk_prob'],
30-
'moe_aux_loss_coeff': ['router_aux_loss_coef'],
3130
# deepseek
3231
'q_lora_rank': ['q_lora_rank'],
3332
'kv_lora_rank': ['kv_lora_rank'],

swift/trainers/arguments.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ class TrainArgumentsMixin:
3030
gradient_checkpointing_kwargs: Optional[Union[dict, str]] = None
3131
logging_first_step: bool = True
3232
logging_steps: int = 5
33-
router_aux_loss_coef: Optional[float] = None
33+
router_aux_loss_coef: float = 0.
3434

3535
weight_decay: float = 0.1
3636
adam_beta2: float = 0.95

swift/trainers/trainers.py

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,8 @@ def prediction_step(
303303
return None, response_list, labels_list
304304

305305
def _prepare_inputs(self, inputs):
306+
from swift.llm import HfConfigFactory
307+
args = self.args
306308
inputs = super()._prepare_inputs(inputs)
307309
from swift.plugin.loss import get_loss_func
308310
loss_kwargs = {}
@@ -315,7 +317,7 @@ def _prepare_inputs(self, inputs):
315317

316318
sample_channels = inputs.pop('channel', None)
317319
position_ids = inputs.pop('_position_ids', None)
318-
if self.args.channels is not None:
320+
if args.channels is not None:
319321
assert sample_channels is not None, f'sample_channels: {sample_channels}'
320322
state = self.state
321323
setattr(state, 'local_step', getattr(state, 'local_step', 0))
@@ -334,22 +336,17 @@ def _prepare_inputs(self, inputs):
334336
inputs['labels'], logits_to_keep = self.get_logits_to_keep(inputs['labels'])
335337
if logits_to_keep is not None:
336338
inputs['logits_to_keep'] = logits_to_keep
337-
if self.args.tuner_backend == 'unsloth' and isinstance(logits_to_keep, torch.Tensor):
339+
if args.tuner_backend == 'unsloth' and isinstance(logits_to_keep, torch.Tensor):
338340
inputs['logits_to_keep'] = int(logits_to_keep.sum())
339341

340-
if self.model.model_info.is_moe_model:
341-
base_model = self.template.get_base_model(self.model)
342-
router_aux_loss_coef = self.args.router_aux_loss_coef
343-
if router_aux_loss_coef is None:
344-
router_aux_loss_coef = getattr(base_model.config, 'router_aux_loss_coef', None)
345-
if router_aux_loss_coef is not None:
346-
from swift.llm import HfConfigFactory
347-
HfConfigFactory.set_config_attr(base_model.config, 'router_aux_loss_coef', router_aux_loss_coef)
348-
base_model.router_aux_loss_coef = router_aux_loss_coef
349-
logger.info_once(f'router_aux_loss_coef: {router_aux_loss_coef}')
350-
if router_aux_loss_coef > 0 and 'output_router_logits' in inspect.signature(
351-
base_model.forward).parameters:
352-
inputs['output_router_logits'] = True
342+
base_model = self.template.get_base_model(self.model)
343+
if self.model.model_info.is_moe_model and 'output_router_logits' in inspect.signature(
344+
base_model.forward).parameters:
345+
HfConfigFactory.set_config_attr(base_model.config, 'router_aux_loss_coef', args.router_aux_loss_coef)
346+
base_model.router_aux_loss_coef = args.router_aux_loss_coef
347+
logger.info_once(f'router_aux_loss_coef: {args.router_aux_loss_coef}')
348+
if args.router_aux_loss_coef > 0:
349+
inputs['output_router_logits'] = True
353350
inputs['compute_loss_func'] = compute_loss_func
354351
inputs['loss_kwargs'] = loss_kwargs
355352
return inputs

0 commit comments

Comments
 (0)