You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ Therefore, to facilitate the training of LLaMA-based models and reduce the cost
35
35
36
36
# OverlappedDistributedOptimizer
37
37
38
-
In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`]("https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md") to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).
38
+
In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md) to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).
39
39
40
40
To mitigate the overhead, we try to overlap the collective communication with computation, according to the partition strategy in DeepSpeed ZeRO Stage-2. This strategy fails to scale. It takes too many small `Reduce` operations at large scale, which makes it under-utilize the inter-connection bandwidth.
41
41
@@ -125,7 +125,7 @@ In particular, we recommend to increase the micro-batch size to fully occupy the
125
125
|`--tokenizer-type=PretrainedFromHF`| Use a Tokenizer from Huggingface (would be loaded via `transformers.AutoTokenizer`) |
126
126
|`--distributed-checkpointing`| Distributed saving of checkpoint files. |
127
127
128
-
Megatron-LLaMA supports the canonical [data prepocessing]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing") and [evaluation]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks") as mentioned in the Megatron-LM library.
128
+
Megatron-LLaMA supports the canonical [data prepocessing](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing) and [evaluation](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks) as mentioned in the Megatron-LM library.
129
129
130
130
### Future work
131
131
@@ -144,9 +144,9 @@ Megatron-LLaMA is developed by Aicheng Technology, Alibaba Group and is based on
144
144
145
145
The following repositories are used in Megatron-LLaMA, either in close to original form or as an inspiration:
0 commit comments