Skip to content

Commit 8fe7698

Browse files
authored
Update megatron.md
1 parent aebc662 commit 8fe7698

File tree

1 file changed

+0
-1
lines changed

1 file changed

+0
-1
lines changed

docs/_tutorials/megatron.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -400,4 +400,3 @@ More concretely, DeepSpeed and ZeRO-2 excel in four aspects (as visualized in Fi
400400
**Democratizing large model training**: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data-parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters for comparison.
401401

402402
Furthermore, in the absence of model parallelism, these models can be trained on low bandwidth clusters while still achieving significantly better throughput compared to using model parallelism. For example, the GPT-2 model can be trained nearly 4x faster with ZeRO powered data parallelism compared to using model parallelism on a four node cluster connected with 40 Gbps Infiniband interconnect, where each node has four NVIDIA 16GB V100 GPUs connected with PCI-E. Therefore, with this performance improvement, large model training is no longer limited to GPU clusters with ultra fast interconnect, but also accessible on modest clusters with limited bandwidth.
403-

0 commit comments

Comments
 (0)