Skip to content

Commit 59c2642

Browse files
authored
[docs] update docs (#6090)
1 parent 532b83f commit 59c2642

File tree

7 files changed

+143
-116
lines changed

7 files changed

+143
-116
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@
213213
- train_dataloader_shuffle: CPT/SFT训练的dataloader是否随机,默认为True。该参数对IterableDataset无效。IterableDataset采用顺序的方式读取。
214214
- 🔥neftune_noise_alpha: neftune添加的噪声系数。默认为0,通常可以设置为5、10、15。
215215
- 🔥use_liger_kernel: 是否启用[Liger](https://github.com/linkedin/Liger-Kernel)内核加速训练并减少显存消耗。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger)
216-
- 注意:liger_kernel不支持device_map,请使用DDP/DeepSpeed进行多卡训练。
216+
- 注意:liger_kernel不支持device_map,请使用DDP/DeepSpeed进行多卡训练。liger_kernel目前只支持`task_type='causal_lm'`
217217
- average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True,将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False。
218218
- max_grad_norm: 梯度裁剪。默认为1.。
219219
- 注意:日志中的grad_norm记录的是裁剪前的值。
@@ -448,7 +448,7 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
448448
RLHF参数继承于[训练参数](#训练参数)
449449

450450
- 🔥rlhf_type: 人类对齐算法类型,支持'dpo'、'orpo'、'simpo'、'kto'、'cpo'、'rm'、'ppo'、'grpo'和'gkd'。默认为'dpo'。
451-
- ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None。
451+
- ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None,设置为`--model`
452452
- ref_adapters: 默认为`[]`。若你要使用SFT产生的LoRA权重进行DPO/KTO/GRPO,请使用"ms-swift>=3.8",并在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`。若是此场景的断点续训,则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`
453453
- ref_model_type: 同model_type。默认为None。
454454
- ref_model_revision: 同model_revision。默认为None。

docs/source/Megatron-SWIFT/命令行参数.md

Lines changed: 65 additions & 53 deletions
Large diffs are not rendered by default.

docs/source/Megatron-SWIFT/快速开始.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
160160
- Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块,因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/自定义数据集.md)
161161
- **更多案例**:包括packing、多机、32K上下文、DPO、MoE模型、预训练,可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)
162162

163-
训练小贴士:
163+
164+
## 训练技巧
164165
- 增加训练吞吐量方法:使用packing、增加DP、减少重计算、增加计算通信overlap。
165166
- 并行技术选择:
166167
- Megatron-SWIFT的并行技术采用zero1(默认开启use_distributed_optimizer)+各种并行技术的组合。

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ Other important parameters:
214214
- train_dataloader_shuffle: Whether to shuffle the dataloader in CPT/SFT training. Default is `True`. Not effective for `IterableDataset`, which uses sequential loading.
215215
- 🔥neftune_noise_alpha: Noise magnitude for NEFTune. Default is 0. Common values: 5, 10, 15.
216216
- 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger).
217-
- Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training.
217+
- Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`.
218218
- average_tokens_across_devices: Whether to average token counts across devices. If `True`, `num_tokens_in_batch` is synchronized via `all_reduce` for accurate loss computation. Default is `False`.
219219
- max_grad_norm: Gradient clipping. Default is 1.
220220
- Note: The logged `grad_norm` reflects the value **before** clipping.
@@ -423,7 +423,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
423423
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
424424
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
425425
- Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
426-
- cached_dataset: Use cached datasets during training (generated via the command `swift export --to_cached_dataset true ...`) to avoid GPU memory being occupied by tokenization when training with large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
426+
- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
427427
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
428428
- use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
429429
- acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
@@ -456,7 +456,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
456456
RLHF arguments inherit from the [training arguments](#training-arguments).
457457

458458
- 🔥rlhf_type: Type of human alignment algorithm, supporting 'dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo', 'grpo' and 'gkd'. Default is 'dpo'.
459-
- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None.
459+
- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None, set to `--model`.
460460
- ref_adapters: Default is `[]`. If you want to use the LoRA weights generated from SFT for DPO/KTO/GRPO, please use "ms-swift>=3.8" and set `--adapters sft_ckpt --ref_adapters sft_ckpt`. For resuming training from a checkpoint in this scenario, set `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
461461
- ref_model_type: Same as model_type. Default is None.
462462
- ref_model_revision: Same as model_revision. Default is None.
@@ -749,14 +749,14 @@ The parameter meanings are the same as in the `qwen_vl_utils>=0.0.14` library
749749

750750
- SPATIAL_MERGE_SIZE: default 2.
751751
- IMAGE_MIN_TOKEN_NUM: default `4`, denotes the minimum number of image tokens per image.
752-
- 🔥 IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
752+
- 🔥IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
753753
- VIDEO_MIN_TOKEN_NUM: default `128`, denotes the minimum number of video tokens per frame.
754-
- 🔥 VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
754+
- 🔥VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
755755
- MAX_RATIO: default 200.
756756
- FRAME_FACTOR: default 2.
757757
- FPS: default 2.0.
758758
- FPS_MIN_FRAMES: default 4, denotes the minimum number of sampled frames for a video segment.
759-
- 🔥 FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
759+
- 🔥FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
760760

761761

762762
### internvl, internvl_phi3

0 commit comments

Comments
 (0)