huggingface · SunMarc · Nov 10, 2025 · Nov 7, 2025
diff --git a/docs/source/concept_guides/fsdp1_vs_fsdp2.md b/docs/source/concept_guides/fsdp1_vs_fsdp2.md
@@ -53,7 +53,7 @@ Each Parameter of the original `Layer` is sharded across the 0th dimension, and
 
 `FSDP2` is a new and improved version of PyTorch's fully-sharded data parallel training API. Its main advantage is using `DTensor` to represent sharded parameters. Compared to `FSDP1`, it offers:
 - Simpler internal implementation, where each `Parameter` is a separate `DTensor`
-- Enables simple partial parameter freezing because of the above, which makes methods as [`LORA`](https://arxiv.org/abs/2106.09685) work out of the box
+- Enables simple partial parameter freezing because of the above, which makes methods as [`LORA`](https://huggingface.co/papers/2106.09685) work out of the box
 - With `DTensor`, `FSDP2` supports mixing `fp8` and other parameter types in the same model out of the box
 - Faster and simpler checkpointing without extra communication across ranks using `SHARDED_STATE_DICT` and [`torch.distributed.checkpoint`](https://pytorch.org/docs/stable/distributed.checkpoint.html), this way, each rank only saves its own shard and corresponding metadata
 - For loading, it uses a `state_dict` of the sharded model to directly load the sharded parameters

diff --git a/docs/source/usage_guides/deepspeed.md b/docs/source/usage_guides/deepspeed.md
@@ -15,7 +15,7 @@ rendered properly in your Markdown viewer.
 
 # DeepSpeed
 
-[DeepSpeed](https://github.com/deepspeedai/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Some of the salient optimizations are:
+[DeepSpeed](https://github.com/deepspeedai/DeepSpeed) implements everything described in the [ZeRO paper](https://huggingface.co/papers/1910.02054). Some of the salient optimizations are:
 
 1. Optimizer state partitioning (ZeRO stage 1)
 2. Gradient partitioning (ZeRO stage 2)
@@ -25,8 +25,8 @@ rendered properly in your Markdown viewer.
 6. ZeRO-Offload to CPU and Disk/NVMe
 7. Hierarchical partitioning of model parameters (ZeRO++)
 
-ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
-Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
+ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://huggingface.co/papers/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
+Memory Wall for Extreme Scale Deep Learning](https://huggingface.co/papers/2104.07857).
 
 DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
 
@@ -728,10 +728,10 @@ The documentation for the internals related to deepspeed can be found [here](../
 
 Papers:
 
-- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
-- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
-- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
-- [ZeRO++: Extremely Efficient Collective Communication for Giant Model Training](https://arxiv.org/abs/2306.10209)
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://huggingface.co/papers/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://huggingface.co/papers/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://huggingface.co/papers/2104.07857)
+- [ZeRO++: Extremely Efficient Collective Communication for Giant Model Training](https://huggingface.co/papers/2306.10209)
 
 
 Finally, please, remember that `Accelerate` only integrates DeepSpeed, therefore if you

diff --git a/docs/source/usage_guides/local_sgd.md b/docs/source/usage_guides/local_sgd.md
@@ -100,9 +100,9 @@ The current implementation works only with basic multi-GPU (or multi-CPU) traini
     back to at least:
 
     Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). [Parallel SGD: When does averaging help?. arXiv preprint
-    arXiv:1606.07365.](https://arxiv.org/abs/1606.07365)
+    arXiv:1606.07365.](https://huggingface.co/papers/1606.07365)
 
     We credit the term Local SGD to the following paper (but there might be earlier references we are not aware of).
 
     Stich, Sebastian Urban. ["Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on
-    Learning Representations. No. CONF. 2019.](https://arxiv.org/abs/1805.09767)
+    Learning Representations. No. CONF. 2019.](https://huggingface.co/papers/1805.09767)
diff --git a/docs/source/usage_guides/megatron_lm.md b/docs/source/usage_guides/megatron_lm.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) enables training large transformer language models at scale.
 It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based
-Language Models such as [GPT](https://arxiv.org/abs/2005.14165) (Decoder Only), [BERT](https://arxiv.org/pdf/1810.04805.pdf) (Encoder Only) and [T5](https://arxiv.org/abs/1910.10683) (Encoder-Decoder).
+Language Models such as [GPT](https://huggingface.co/papers/2005.14165) (Decoder Only), [BERT](https://huggingface.co/papers/1810.04805) (Encoder Only) and [T5](https://huggingface.co/papers/1910.10683) (Encoder-Decoder).
 For detailed information and how things work behind the scene please refer to the github [repo](https://github.com/NVIDIA/Megatron-LM).
 
 ## What is integrated?
@@ -31,7 +31,7 @@ Each tensor is split into multiple chunks with each shard residing on separate G
 independently and in parallel by each shard followed by syncing across all GPUs (`all-reduce` operation). 
 In a simple transformer layer, this leads to 2 `all-reduces` in the forward path and 2 in the backward path.
 For more details, please refer to the research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
-Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) and 
+Model Parallelism](https://huggingface.co/papers/1909.08053) and 
 this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#tensor-parallelism).
 
 
@@ -40,7 +40,7 @@ Reduces the bubble of naive PP via PipeDream-Flush schedule/1F1B schedule and In
 Layers are distributed uniformly across PP stages. For example, if a model has `24` layers and we have `4` GPUs for
 pipeline parallelism, each GPU will have `6` layers (24/4). For more details on schedules to reduce the idle time of PP,
 please refer to the research paper [Efficient Large-Scale Language Model Training on GPU Clusters
-Using Megatron-LM](https://arxiv.org/pdf/2104.04473.pdf) and 
+Using Megatron-LM](https://huggingface.co/papers/2104.04473) and 
 this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#pipeline-parallelism).
 
 c. **Sequence Parallelism (SP)**: Reduces memory footprint without any additional communication. Only applicable when using TP.
@@ -50,21 +50,21 @@ As `all-reduce = reduce-scatter + all-gather`, this saves a ton of activation me
 To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g., 
 if the sequence length is `1024` and the TP size is `4`, each GPU will have `256` tokens (1024/4) for each sample. 
 This increases the batch size that can be supported for training. For more details, please refer to the research paper
-[Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf). 
+[Reducing Activation Recomputation in Large Transformer Models](https://huggingface.co/papers/2205.05198). 
 
 d. **Data Parallelism (DP)** via Distributed Optimizer: Reduces the memory footprint by sharding optimizer states and gradients across DP ranks
 (versus the traditional method of replicating the optimizer state across data parallel ranks). 
 For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory.
 This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs.
 For more details, please refer to the research paper [ZeRO: Memory Optimizations Toward Training Trillion
-Parameter Models](https://arxiv.org/pdf/1910.02054.pdf) and following section of blog 
+Parameter Models](https://huggingface.co/papers/1910.02054) and following section of blog 
 [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#zero-data-parallelism).
 
 e. **Selective Activation Recomputation**: Reduces the memory footprint of activations significantly via smart activation checkpointing.
 It doesn't store activations occupying large memory while being fast to recompute thereby achieving great tradeoff between memory and recomputation.
 For example, for GPT-3, this leads to 70% reduction in required memory for activations at the expense of
 only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper 
-[Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf).
+[Reducing Activation Recomputation in Large Transformer Models](https://huggingface.co/papers/2205.05198).
 
 f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer.
 PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.