Skip to content

Commit 827c3e1

Browse files
committed
Fix broken links in README
1 parent d2e05b8 commit 827c3e1

File tree

2 files changed

+9
-9
lines changed

2 files changed

+9
-9
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Therefore, to facilitate the training of LLaMA-based models and reduce the cost
3535

3636
# OverlappedDistributedOptimizer
3737

38-
In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`]("https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md") to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).
38+
In the vanilla Megatron-LM, users can leverage [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md) to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, `DistributedOptimizer` employs a `ReduceScatter` operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an `AllGather` operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).
3939

4040
To mitigate the overhead, we try to overlap the collective communication with computation, according to the partition strategy in DeepSpeed ZeRO Stage-2. This strategy fails to scale. It takes too many small `Reduce` operations at large scale, which makes it under-utilize the inter-connection bandwidth.
4141

@@ -125,7 +125,7 @@ In particular, we recommend to increase the micro-batch size to fully occupy the
125125
| `--tokenizer-type=PretrainedFromHF` | Use a Tokenizer from Huggingface (would be loaded via `transformers.AutoTokenizer`) |
126126
| `--distributed-checkpointing` | Distributed saving of checkpoint files. |
127127

128-
Megatron-LLaMA supports the canonical [data prepocessing]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing") and [evaluation]("https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks") as mentioned in the Megatron-LM library.
128+
Megatron-LLaMA supports the canonical [data prepocessing](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#data-preprocessing) and [evaluation](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md#evaluation-and-tasks) as mentioned in the Megatron-LM library.
129129

130130
### Future work
131131

@@ -144,9 +144,9 @@ Megatron-LLaMA is developed by Aicheng Technology, Alibaba Group and is based on
144144

145145
The following repositories are used in Megatron-LLaMA, either in close to original form or as an inspiration:
146146

147-
[Megatron-LM]("https://github.com/NVIDIA/Megatron-LM")
147+
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
148148

149-
[LLaMA]("https://github.com/facebookresearch/llama")
149+
[LLaMA](https://github.com/facebookresearch/llama)
150150

151-
[DeepSpeed]("https://github.com/microsoft/DeepSpeed")
151+
[DeepSpeed](https://github.com/microsoft/DeepSpeed)
152152

README_zh.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ LLaMA是目前大语言模型开源社区中一项重要工作。LLaMA在LLM的
3636

3737
## 2. Megatron-LLaMA中`OverlappedDistributedOptimizer`简介
3838

39-
在原生Megatron-LM中,用户可以使用[`DistributedOptimizer`]("https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md")来切分梯度和优化器状态,以减少训练中的显存占用。`DistributedOptimizer`在每次获得预设的梯度聚合组梯度后,通过`ReduceScatter`算子,将之前累积的全部梯度分发到不同的Rank。每个Rank更新完属于自己的参数后,再通过`AllGather`算子将更新后的参数复制到所有Rank。在实际训练中,我们观察到`DistributedOptimizer`的集合通信在梯度聚合较小的情况下,将引入极大的额外开销。极端情况下,不使用梯度聚合,将引入超过整体耗时50%的额外开销。
39+
在原生Megatron-LM中,用户可以使用[`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/distrib_optimizer.md)来切分梯度和优化器状态,以减少训练中的显存占用。`DistributedOptimizer`在每次获得预设的梯度聚合组梯度后,通过`ReduceScatter`算子,将之前累积的全部梯度分发到不同的Rank。每个Rank更新完属于自己的参数后,再通过`AllGather`算子将更新后的参数复制到所有Rank。在实际训练中,我们观察到`DistributedOptimizer`的集合通信在梯度聚合较小的情况下,将引入极大的额外开销。极端情况下,不使用梯度聚合,将引入超过整体耗时50%的额外开销。
4040

4141
在尝试实现通信和计算并行的过程中,我们尝试了DeepSpeed ZeRO2中对梯度以及优化器状态的切分方式。在超大规模的场景下,我们观察到其切分方式需要大量细碎的通信Kernel,无法充分利用通信带宽,造成了通信耗时过长,模型的计算量不足以与通信充分并行。
4242

@@ -142,8 +142,8 @@ Megatron-LLaMA使用Apache 2.0开源协议,允许用作商业用途。详情
142142

143143
### 参考工作
144144

145-
[Megatron-LM]("https://github.com/NVIDIA/Megatron-LM")
145+
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
146146

147-
[LLaMA]("https://github.com/facebookresearch/llama")
147+
[LLaMA](https://github.com/facebookresearch/llama)
148148

149-
[DeepSpeed]("https://github.com/microsoft/DeepSpeed")
149+
[DeepSpeed](https://github.com/microsoft/DeepSpeed)

0 commit comments

Comments
 (0)