You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# If you are using multi-node training, please additionally set the `MODELSCOPE_CACHE` environment variable to a shared storage path.
30
30
# This will ensure that the dataset cache is shared, thereby speeding up preprocessing.
31
+
# Note: This step is crucial; otherwise multi-machine training may hang due to data inconsistencies caused by randomness in data preprocessing.
31
32
export MODELSCOPE_CACHE='/xxx/shared'
32
33
33
34
# Megatron-LM
34
35
# The training module in the dependent library Megatron-LM will be cloned and installed by swift via `git clone`. Alternatively, you can use the environment variable `MEGATRON_LM_PATH` to point to the path of an already downloaded repository (in offline environments, use the [core_r0.13.0 branch](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.13.0)).
# Choose an appropriate version to install: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
41
+
# Note: Do not install a version higher than the maximum supported by transformer_engine: https://github.com/NVIDIA/TransformerEngine/blob/release_v2.6/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L109
37
42
```
38
43
39
44
Alternatively, you can also use the image:
@@ -149,6 +154,15 @@ I am a language model developed by swift, you can call me swift-robot. How can I
149
154
- Megatron-SWIFT uses the same dataset and template processing modules as ms-swift, thus supporting techniques such as packing, loss scale, and agent training. For custom dataset formats, please refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
150
155
-**More Examples**: Including packing, multi-node training, 32K context length, DPO, MoE models, and pre-training, can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/megatron).
151
156
157
+
Training tips:
158
+
- Ways to increase training throughput: use packing, increase DP (data parallelism), reduce recomputation, and increase computation-communication overlap.
159
+
- Parallelism choices:
160
+
- Megatron-SWIFT uses ZeRO-1 (use_distributed_optimizer enabled by default) combined with various parallelism techniques.
161
+
- DP is the fastest but consumes the most memory; use other parallel techniques to reduce memory usage.
162
+
- TP/EP involve heavy communication, so keep them within the NVLink domain when possible; for cross-domain setups prefer PP/DP. For expert layers, prefer EP over ETP — ETP saves memory but is slower.
163
+
- MoE parallel folding: separate MoE parallel groups from Dense groups. Attention uses tp-cp-dp-pp groups, while MoE uses etp-ep-dp-pp groups.
164
+
- Choosing parallelism for weight conversion: Megatron-SWIFT uses the torch_dist storage format on the MCore side; you can adjust parallelism at training time and do not need to specify it during weight conversion.
165
+
152
166
153
167
## Benchmark
154
168
The training speed comparison for full-parameter dense models with 8K context length, using `megatron sft` and `swift sft`, under a single-node, eight-GPU A800 environment is as follows:
0 commit comments