Skip to content

Commit c7a511d

Browse files
authored
Merge branch 'main' into 2.8-RC-Tutorial-Test
2 parents 71e1c45 + 2f4e5c3 commit c7a511d

File tree

2 files changed

+34
-37
lines changed

2 files changed

+34
-37
lines changed

intermediate_source/ddp_series_minGPT.rst

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,11 @@ Authors: `Suraj Subramanian <https://github.com/subramen>`__
2626
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
2727
:class-card: card-prerequisites
2828

29-
- Familiarity with `multi-GPU training <../beginner/ddp_series_multigpu.html>`__ and `torchrun <../beginner/ddp_series_fault_tolerance.html>`__
30-
- [Optional] Familiarity with `multinode training <ddp_series_multinode.html>`__
31-
- 2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances)
3229
- PyTorch `installed <https://pytorch.org/get-started/locally/>`__ with CUDA on all machines
30+
- Familiarity with `multi-GPU training <../beginner/ddp_series_multigpu.html>`__ and `torchrun <../beginner/ddp_series_fault_tolerance.html>`__
31+
- [Optional] Familiarity with `multinode training <ddp_series_multinode.html>`__
32+
- 2 or more TCP-reachable GPU machines for multi-node training (this tutorial uses AWS p3.2xlarge instances)
33+
3334

3435
Follow along with the video below or on `youtube <https://www.youtube.com/watch/XFsFDGKZHh4>`__.
3536

@@ -63,25 +64,23 @@ from any node that has access to the cloud bucket.
6364

6465
Using Mixed Precision
6566
~~~~~~~~~~~~~~~~~~~~~~~~
66-
To speed things up, you might be able to use `Mixed Precision <https://pytorch.org/docs/stable/amp.html>`__ to train your models.
67-
In Mixed Precision, some parts of the training process are carried out in reduced precision, while other steps
68-
that are more sensitive to precision drops are maintained in FP32 precision.
67+
To speed things up, you might be able to use `Mixed Precision <https://pytorch.org/docs/stable/amp.html>`__ to train your models.
68+
In Mixed Precision, some parts of the training process are carried out in reduced precision, while other steps
69+
that are more sensitive to precision drops are maintained in FP32 precision.
6970

7071

7172
When is DDP not enough?
7273
~~~~~~~~~~~~~~~~~~~~~~~~
7374
A typical training run's memory footprint consists of model weights, activations, gradients, the input batch, and the optimizer state.
74-
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
75+
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
7576
When models grow larger, more aggressive techniques might be useful:
7677

77-
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
78-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
79-
78+
- `Activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
79+
- `Fully-Sharded Data Parallel <https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
8080

8181
Further Reading
8282
---------------
8383
- `Multi-Node training with DDP <ddp_series_multinode.html>`__ (previous tutorial in this series)
8484
- `Mixed Precision training <https://pytorch.org/docs/stable/amp.html>`__
85-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__
85+
- `Fully-Sharded Data Parallel tutorial <https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
8686
- `Training a 1T parameter model with FSDP <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__
87-
- `FSDP Video Tutorial Series <https://www.youtube.com/playlist?list=PL_lsbAsL_o2BT6aerEKgIoufVD_fodnuT>`__

recipes_source/recipes/profiler_recipe.py

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,29 @@
55
"""
66

77
######################################################################
8-
"""
9-
This recipe explains how to use PyTorch profiler and measure the time and
10-
memory consumption of the model's operators.
11-
12-
Introduction
13-
------------
14-
PyTorch includes a simple profiler API that is useful when user needs
15-
to determine the most expensive operators in the model.
16-
17-
In this recipe, we will use a simple Resnet model to demonstrate how to
18-
use profiler to analyze model performance.
19-
20-
Prerequisites
21-
---------------
22-
- ``torch >= 1.9``
23-
24-
Setup
25-
-----
26-
To install ``torch`` and ``torchvision`` use the following command:
27-
28-
.. code-block:: sh
29-
30-
pip install torch torchvision
31-
32-
"""
8+
# This recipe explains how to use PyTorch profiler and measure the time and
9+
# memory consumption of the model's operators.
10+
#
11+
# Introduction
12+
# ------------
13+
# PyTorch includes a simple profiler API that is useful when the user needs
14+
# to determine the most expensive operators in the model.
15+
#
16+
# In this recipe, we will use a simple Resnet model to demonstrate how to
17+
# use the profiler to analyze model performance.
18+
#
19+
# Prerequisites
20+
# ---------------
21+
# - ``torch >= 2.3.0``
22+
#
23+
# Setup
24+
# -----
25+
# To install ``torch`` and ``torchvision`` use the following command:
26+
#
27+
# .. code-block:: sh
28+
#
29+
# pip install torch torchvision
30+
#
3331

3432
######################################################################
3533
# Steps

0 commit comments

Comments
 (0)