You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/Command-line-parameters.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -214,7 +214,7 @@ Other important parameters:
214
214
- train_dataloader_shuffle: Whether to shuffle the dataloader in CPT/SFT training. Default is `True`. Not effective for `IterableDataset`, which uses sequential loading.
215
215
- 🔥neftune_noise_alpha: Noise magnitude for NEFTune. Default is 0. Common values: 5, 10, 15.
216
216
- 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger).
217
-
- Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training.
217
+
- Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`.
218
218
- average_tokens_across_devices: Whether to average token counts across devices. If `True`, `num_tokens_in_batch` is synchronized via `all_reduce` for accurate loss computation. Default is `False`.
219
219
- max_grad_norm: Gradient clipping. Default is 1.
220
220
- Note: The logged `grad_norm` reflects the value **before** clipping.
@@ -423,7 +423,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
423
423
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
424
424
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
425
425
- Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
426
-
- cached_dataset: Use cached datasets during training (generated via the command `swift export --to_cached_dataset true ...`) to avoid GPU memory being occupied by tokenization when training with large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
426
+
- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
427
427
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
428
428
- use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
429
429
- acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
@@ -456,7 +456,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
456
456
RLHF arguments inherit from the [training arguments](#training-arguments).
457
457
458
458
- 🔥rlhf_type: Type of human alignment algorithm, supporting 'dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo', 'grpo' and 'gkd'. Default is 'dpo'.
459
-
- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None.
459
+
- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None, set to `--model`.
460
460
- ref_adapters: Default is `[]`. If you want to use the LoRA weights generated from SFT for DPO/KTO/GRPO, please use "ms-swift>=3.8" and set `--adapters sft_ckpt --ref_adapters sft_ckpt`. For resuming training from a checkpoint in this scenario, set `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
461
461
- ref_model_type: Same as model_type. Default is None.
462
462
- ref_model_revision: Same as model_revision. Default is None.
@@ -749,14 +749,14 @@ The parameter meanings are the same as in the `qwen_vl_utils>=0.0.14` library
749
749
750
750
- SPATIAL_MERGE_SIZE: default 2.
751
751
- IMAGE_MIN_TOKEN_NUM: default `4`, denotes the minimum number of image tokens per image.
752
-
- 🔥IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
752
+
- 🔥IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
753
753
- VIDEO_MIN_TOKEN_NUM: default `128`, denotes the minimum number of video tokens per frame.
754
-
- 🔥VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
754
+
- 🔥VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
755
755
- MAX_RATIO: default 200.
756
756
- FRAME_FACTOR: default 2.
757
757
- FPS: default 2.0.
758
758
- FPS_MIN_FRAMES: default 4, denotes the minimum number of sampled frames for a video segment.
759
-
- 🔥FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
759
+
- 🔥FPS_MAX_FRAMES: default 768, denotes the maximum number of sampled frames for a video segment. (used to avoid OOM)
0 commit comments