modelscope
diff --git a/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/Megatron-SWIFT/命令行参数.md‎
Lines changed: 65 additions & 53 deletions b/‎docs/source/Megatron-SWIFT/命令行参数.md‎
Lines changed: 65 additions & 53 deletions
diff --git a/‎docs/source/Megatron-SWIFT/快速开始.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/source/Megatron-SWIFT/快速开始.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 6 additions & 6 deletions
@@ -213,7 +213,7 @@
 - train_dataloader_shuffle: CPT/SFT训练的dataloader是否随机，默认为True。该参数对IterableDataset无效。IterableDataset采用顺序的方式读取。
 - 🔥neftune_noise_alpha: neftune添加的噪声系数。默认为0，通常可以设置为5、10、15。
 - 🔥use_liger_kernel: 是否启用[Liger](https://github.com/linkedin/Liger-Kernel)内核加速训练并减少显存消耗。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger)。
-  - 注意：liger_kernel不支持device_map，请使用DDP/DeepSpeed进行多卡训练。
+  - 注意：liger_kernel不支持device_map，请使用DDP/DeepSpeed进行多卡训练。liger_kernel目前只支持`task_type='causal_lm'`。
 - average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True，将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False。
 - max_grad_norm: 梯度裁剪。默认为1.。
   - 注意：日志中的grad_norm记录的是裁剪前的值。
@@ -448,7 +448,7 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数，
 RLHF参数继承于[训练参数](#训练参数)。
 
 - 🔥rlhf_type: 人类对齐算法类型，支持'dpo'、'orpo'、'simpo'、'kto'、'cpo'、'rm'、'ppo'、'grpo'和'gkd'。默认为'dpo'。
-- ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None。
+- ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None，设置为`--model`。
 - ref_adapters: 默认为`[]`。若你要使用SFT产生的LoRA权重进行DPO/KTO/GRPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`。若是此场景的断点续训，则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`。
 - ref_model_type: 同model_type。默认为None。
 - ref_model_revision: 同model_revision。默认为None。
 
@@ -160,7 +160,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块，因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/自定义数据集.md)。
 - **更多案例**：包括packing、多机、32K上下文、DPO、MoE模型、预训练，可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)。
 
-训练小贴士：
+
+## 训练技巧
 - 增加训练吞吐量方法：使用packing、增加DP、减少重计算、增加计算通信overlap。
 - 并行技术选择：
   - Megatron-SWIFT的并行技术采用zero1（默认开启use_distributed_optimizer）+各种并行技术的组合。
 
@@ -214,7 +214,7 @@ Other important parameters:
 - train_dataloader_shuffle: Whether to shuffle the dataloader in CPT/SFT training. Default is `True`. Not effective for `IterableDataset`, which uses sequential loading.
 - 🔥neftune_noise_alpha: Noise magnitude for NEFTune. Default is 0. Common values: 5, 10, 15.
 - 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger).
-  - Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training.
+  - Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`.
 - average_tokens_across_devices: Whether to average token counts across devices. If `True`, `num_tokens_in_batch` is synchronized via `all_reduce` for accurate loss computation. Default is `False`.
 - max_grad_norm: Gradient clipping. Default is 1.
   - Note: The logged `grad_norm` reflects the value **before** clipping.
@@ -423,7 +423,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
 - lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
   - Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
-- cached_dataset: Use cached datasets during training (generated via the command `swift export --to_cached_dataset true ...`) to avoid GPU memory being occupied by tokenization when training with large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
+- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
   - Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
 - use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
 - acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
@@ -456,7 +456,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 RLHF arguments inherit from the [training arguments](#training-arguments).
 
 - 🔥rlhf_type: Type of human alignment algorithm, supporting 'dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo', 'grpo' and 'gkd'. Default is 'dpo'.
-- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None.
+- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None, set to `--model`.
 - ref_adapters: Default is `[]`. If you want to use the LoRA weights generated from SFT for DPO/KTO/GRPO, please use "ms-swift>=3.8" and set `--adapters sft_ckpt --ref_adapters sft_ckpt`. For resuming training from a checkpoint in this scenario, set `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
 - ref_model_type: Same as model_type. Default is None.
 - ref_model_revision: Same as model_revision. Default is None.
@@ -749,14 +749,14 @@ The parameter meanings are the same as in the `qwen_vl_utils>=0.0.14` library
 
 - SPATIAL_MERGE_SIZE: default 2.
 - IMAGE_MIN_TOKEN_NUM: default `4`, denotes the minimum number of image tokens per image.
-- 🔥 IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
+- 🔥IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
 - VIDEO_MIN_TOKEN_NUM: default `128`, denotes the minimum number of video tokens per frame.
-- 🔥 VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
+- 🔥VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
 - MAX_RATIO: default 200.
 - FRAME_FACTOR: default 2.
 - FPS: default 2.0.
 - FPS_MIN_FRAMES: default 4, denotes the minimum number of sampled frames for a video segment.
-- 🔥 FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
+- 🔥FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
 
 
 ### internvl, internvl_phi3