Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/concept_guides/fsdp1_vs_fsdp2.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Each Parameter of the original `Layer` is sharded across the 0th dimension, and

`FSDP2` is a new and improved version of PyTorch's fully-sharded data parallel training API. Its main advantage is using `DTensor` to represent sharded parameters. Compared to `FSDP1`, it offers:
- Simpler internal implementation, where each `Parameter` is a separate `DTensor`
- Enables simple partial parameter freezing because of the above, which makes methods as [`LORA`](https://arxiv.org/abs/2106.09685) work out of the box
- Enables simple partial parameter freezing because of the above, which makes methods as [`LORA`](https://huggingface.co/papers/2106.09685) work out of the box
- With `DTensor`, `FSDP2` supports mixing `fp8` and other parameter types in the same model out of the box
- Faster and simpler checkpointing without extra communication across ranks using `SHARDED_STATE_DICT` and [`torch.distributed.checkpoint`](https://pytorch.org/docs/stable/distributed.checkpoint.html), this way, each rank only saves its own shard and corresponding metadata
- For loading, it uses a `state_dict` of the sharded model to directly load the sharded parameters
Expand Down
14 changes: 7 additions & 7 deletions docs/source/usage_guides/deepspeed.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ rendered properly in your Markdown viewer.

# DeepSpeed

[DeepSpeed](https://github.com/deepspeedai/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Some of the salient optimizations are:
[DeepSpeed](https://github.com/deepspeedai/DeepSpeed) implements everything described in the [ZeRO paper](https://huggingface.co/papers/1910.02054). Some of the salient optimizations are:

1. Optimizer state partitioning (ZeRO stage 1)
2. Gradient partitioning (ZeRO stage 2)
Expand All @@ -25,8 +25,8 @@ rendered properly in your Markdown viewer.
6. ZeRO-Offload to CPU and Disk/NVMe
7. Hierarchical partitioning of model parameters (ZeRO++)

ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://huggingface.co/papers/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning](https://huggingface.co/papers/2104.07857).

DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.

Expand Down Expand Up @@ -728,10 +728,10 @@ The documentation for the internals related to deepspeed can be found [here](../

Papers:

- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
- [ZeRO++: Extremely Efficient Collective Communication for Giant Model Training](https://arxiv.org/abs/2306.10209)
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://huggingface.co/papers/1910.02054)
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://huggingface.co/papers/2101.06840)
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://huggingface.co/papers/2104.07857)
- [ZeRO++: Extremely Efficient Collective Communication for Giant Model Training](https://huggingface.co/papers/2306.10209)


Finally, please, remember that `Accelerate` only integrates DeepSpeed, therefore if you
Expand Down
4 changes: 2 additions & 2 deletions docs/source/usage_guides/local_sgd.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,9 @@ The current implementation works only with basic multi-GPU (or multi-CPU) traini
back to at least:

Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). [Parallel SGD: When does averaging help?. arXiv preprint
arXiv:1606.07365.](https://arxiv.org/abs/1606.07365)
arXiv:1606.07365.](https://huggingface.co/papers/1606.07365)

We credit the term Local SGD to the following paper (but there might be earlier references we are not aware of).

Stich, Sebastian Urban. ["Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on
Learning Representations. No. CONF. 2019.](https://arxiv.org/abs/1805.09767)
Learning Representations. No. CONF. 2019.](https://huggingface.co/papers/1805.09767)
12 changes: 6 additions & 6 deletions docs/source/usage_guides/megatron_lm.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) enables training large transformer language models at scale.
It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based
Language Models such as [GPT](https://arxiv.org/abs/2005.14165) (Decoder Only), [BERT](https://arxiv.org/pdf/1810.04805.pdf) (Encoder Only) and [T5](https://arxiv.org/abs/1910.10683) (Encoder-Decoder).
Language Models such as [GPT](https://huggingface.co/papers/2005.14165) (Decoder Only), [BERT](https://huggingface.co/papers/1810.04805) (Encoder Only) and [T5](https://huggingface.co/papers/1910.10683) (Encoder-Decoder).
For detailed information and how things work behind the scene please refer to the github [repo](https://github.com/NVIDIA/Megatron-LM).

## What is integrated?
Expand All @@ -31,7 +31,7 @@ Each tensor is split into multiple chunks with each shard residing on separate G
independently and in parallel by each shard followed by syncing across all GPUs (`all-reduce` operation).
In a simple transformer layer, this leads to 2 `all-reduces` in the forward path and 2 in the backward path.
For more details, please refer to the research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) and
Model Parallelism](https://huggingface.co/papers/1909.08053) and
this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#tensor-parallelism).


Expand All @@ -40,7 +40,7 @@ Reduces the bubble of naive PP via PipeDream-Flush schedule/1F1B schedule and In
Layers are distributed uniformly across PP stages. For example, if a model has `24` layers and we have `4` GPUs for
pipeline parallelism, each GPU will have `6` layers (24/4). For more details on schedules to reduce the idle time of PP,
please refer to the research paper [Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM](https://arxiv.org/pdf/2104.04473.pdf) and
Using Megatron-LM](https://huggingface.co/papers/2104.04473) and
this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#pipeline-parallelism).

c. **Sequence Parallelism (SP)**: Reduces memory footprint without any additional communication. Only applicable when using TP.
Expand All @@ -50,21 +50,21 @@ As `all-reduce = reduce-scatter + all-gather`, this saves a ton of activation me
To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g.,
if the sequence length is `1024` and the TP size is `4`, each GPU will have `256` tokens (1024/4) for each sample.
This increases the batch size that can be supported for training. For more details, please refer to the research paper
[Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf).
[Reducing Activation Recomputation in Large Transformer Models](https://huggingface.co/papers/2205.05198).

d. **Data Parallelism (DP)** via Distributed Optimizer: Reduces the memory footprint by sharding optimizer states and gradients across DP ranks
(versus the traditional method of replicating the optimizer state across data parallel ranks).
For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory.
This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs.
For more details, please refer to the research paper [ZeRO: Memory Optimizations Toward Training Trillion
Parameter Models](https://arxiv.org/pdf/1910.02054.pdf) and following section of blog
Parameter Models](https://huggingface.co/papers/1910.02054) and following section of blog
[The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#zero-data-parallelism).

e. **Selective Activation Recomputation**: Reduces the memory footprint of activations significantly via smart activation checkpointing.
It doesn't store activations occupying large memory while being fast to recompute thereby achieving great tradeoff between memory and recomputation.
For example, for GPT-3, this leads to 70% reduction in required memory for activations at the expense of
only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper
[Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf).
[Reducing Activation Recomputation in Large Transformer Models](https://huggingface.co/papers/2205.05198).

f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer.
PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.
Expand Down
Loading