Skip to content

Commit 5bff40a

Browse files
authored
Merge branch 'master' into master
2 parents 992de00 + 1fc077b commit 5bff40a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1039
-224
lines changed

.github/workflows/_legacy-checkpoints.yml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,28 +57,32 @@ jobs:
5757
steps:
5858
- uses: actions/checkout@v5
5959

60-
- uses: actions/setup-python@v5
60+
- name: Install uv and set Python version
61+
uses: astral-sh/setup-uv@v6
6162
with:
62-
# Python version here needs to be supported by all PL versions listed in back-compatible-versions.txt.
6363
python-version: "3.9"
64+
# TODO: Avoid activating environment like this
65+
# see: https://github.com/astral-sh/setup-uv/tree/v6/?tab=readme-ov-file#activate-environment
66+
activate-environment: true
67+
enable-cache: true
6468

6569
- name: Install PL from source
6670
env:
6771
PACKAGE_NAME: pytorch
6872
FREEZE_REQUIREMENTS: 1
6973
timeout-minutes: 20
70-
run: pip install . --extra-index-url="${TORCH_URL}"
74+
run: uv pip install . --extra-index-url="${TORCH_URL}"
7175
if: inputs.pl_version == ''
7276

7377
- name: Install PL version
7478
timeout-minutes: 20
75-
run: pip install "pytorch-lightning==${{ inputs.pl_version }}" --extra-index-url="${TORCH_URL}"
79+
run: uv pip install "pytorch-lightning==${{ inputs.pl_version }}" --extra-index-url="${TORCH_URL}"
7680
if: inputs.pl_version != ''
7781

7882
- name: Adjust tests -> PL
7983
if: ${{ matrix.pkg-name != 'lightning' }}
8084
run: |
81-
pip install -q -r .actions/requirements.txt
85+
uv pip install -q -r .actions/requirements.txt
8286
python .actions/assistant.py copy_replace_imports --source_dir="./tests" \
8387
--source_import="lightning.fabric,lightning.pytorch" \
8488
--target_import="lightning_fabric,pytorch_lightning"
@@ -115,7 +119,7 @@ jobs:
115119
# export to env bool if secrets.AWS_REGION is not empty
116120
run: echo "WITH_SECRETS=$([ -n '${{ secrets.AWS_REGION }}' ] && echo 1 || echo 0)" >> $GITHUB_ENV
117121

118-
- run: pip install -r requirements/ci.txt
122+
- run: uv pip install -r requirements/ci.txt
119123
- name: Upload checkpoints to S3
120124
if: ${{ env.WITH_SECRETS == '1' }}
121125
working-directory: ${{ env.LEGACY_FOLDER }}

.github/workflows/ci-tests-fabric.yml

Lines changed: 34 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -62,49 +62,57 @@ jobs:
6262
env:
6363
PACKAGE_NAME: ${{ matrix.config.pkg-name }}
6464
FREEZE_REQUIREMENTS: ${{ ! (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release/')) }}
65-
PYPI_CACHE_DIR: "_pip-wheels"
6665
TORCH_URL_STABLE: "https://download.pytorch.org/whl/cpu/"
6766
TORCH_URL_TEST: "https://download.pytorch.org/whl/test/cpu/"
6867
# TODO: Remove this - Enable running MPS tests on this platform
6968
DISABLE_MPS: ${{ matrix.os == 'macOS-14' && '1' || '0' }}
7069
steps:
7170
- uses: actions/checkout@v5
7271

73-
- name: Set up Python ${{ matrix.config.python-version }}
74-
uses: actions/setup-python@v5
72+
- name: Install uv and set Python version
73+
uses: astral-sh/setup-uv@v6
7574
with:
7675
python-version: ${{ matrix.config.python-version || '3.9' }}
76+
# TODO: Avoid activating environment like this
77+
# see: https://github.com/astral-sh/setup-uv/tree/v6/?tab=readme-ov-file#activate-environment
78+
activate-environment: true
79+
enable-cache: true
7780

78-
- name: basic setup
79-
run: pip install -q -r .actions/requirements.txt
81+
- name: Basic setup
82+
run: uv pip install -q -r .actions/requirements.txt
83+
84+
- name: Append Env. vars for Linux
85+
if: ${{ runner.os == 'Linux' }}
86+
run: echo "GLOO_SOCKET_IFNAME=eth0" >> $GITHUB_ENV
87+
88+
- name: Append Env. vars for MacOS
89+
if: ${{ runner.os == 'macOS' }}
90+
run: echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV
91+
92+
- name: Append Env. vars for Windows
93+
if: ${{ runner.os == 'windows' }}
94+
run: |
95+
# Avoid issue on Windows with PyTorch 2.4: "RuntimeError: use_libuv was requested but PyTorch was build without libuv support"
96+
echo "USE_LIBUV=0" >> $GITHUB_ENV
8097
8198
- name: Set min. dependencies
8299
if: ${{ matrix.config.requires == 'oldest' }}
83100
run: |
84101
cd requirements/fabric
85-
pip install -U "lightning-utilities[cli]"
102+
uv pip install -U "lightning-utilities[cli]"
86103
python -m lightning_utilities.cli requirements set-oldest --req_files "['base.txt', 'strategies.txt', 'test.txt']"
87-
pip install "cython<3.0" wheel
88-
pip install "pyyaml==5.4" --no-build-isolation
104+
uv pip install "cython<3.0" wheel
105+
uv pip install "pyyaml==5.4" --no-build-isolation
89106
90107
- name: Adjust PyTorch versions in requirements files
91108
if: ${{ matrix.config.requires != 'oldest' }}
92109
run: |
93-
pip install -q -r requirements/ci.txt
110+
uv pip install -q -r requirements/ci.txt
94111
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
95112
for fpath in `ls requirements/**/*.txt`; do \
96113
python ./adjust-torch-versions.py $fpath ${{ matrix.config.pytorch-version }}; \
97114
done
98115
99-
- name: pip wheels cache
100-
uses: actions/cache/restore@v4
101-
with:
102-
path: ${{ env.PYPI_CACHE_DIR }}
103-
key: pypi_wheels
104-
- run: |
105-
mkdir -p $PYPI_CACHE_DIR
106-
ls -lh $PYPI_CACHE_DIR
107-
108116
- name: Expand Env. variables
109117
run: |
110118
# Switch PyTorch URL between stable and test/future
@@ -113,25 +121,15 @@ jobs:
113121
python -c "print('COVERAGE_SCOPE=' + str('lightning' if '${{matrix.config.pkg-name}}' == 'lightning' else 'lightning_fabric'))" >> $GITHUB_ENV
114122
# if you install mono-package set dependency only for this subpackage
115123
python -c "print('EXTRA_PREFIX=' + str('' if '${{matrix.config.pkg-name}}' != 'lightning' else 'fabric-'))" >> $GITHUB_ENV
116-
- name: Append Env. vars for MacOS
117-
if: ${{ runner.os == 'macOS' }}
118-
run: |
119-
# trying to avoid "gloo" issue with SIGABRT
120-
echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV
121-
- name: Append Env. vars for Windows
122-
if: ${{ runner.os == 'windows' }}
123-
run: |
124-
# Avoid issue on Windows with PyTorch 2.4: "RuntimeError: use_libuv was requested but PyTorch was build without libuv support"
125-
echo "USE_LIBUV=0" >> $GITHUB_ENV
126124
127125
- name: Install package & dependencies
128126
timeout-minutes: 20
129127
run: |
130-
pip install -e ".[${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
131-
-U --upgrade-strategy=eager --prefer-binary \
132-
--extra-index-url="${TORCH_URL}" \
133-
--find-links="${PYPI_CACHE_DIR}"
134-
pip list
128+
uv pip install ".[${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
129+
--upgrade \
130+
--find-links="${TORCH_URL}"
131+
uv pip list
132+
135133
- name: Dump handy wheels
136134
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
137135
continue-on-error: true
@@ -179,6 +177,9 @@ jobs:
179177
name: CPU-coverage
180178
fail_ci_if_error: false
181179

180+
- name: Minimize uv cache
181+
run: uv cache prune --ci
182+
182183
fabric-cpu-guardian:
183184
runs-on: ubuntu-latest
184185
needs: fabric-cpu

.github/workflows/ci-tests-pytorch.yml

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -69,48 +69,50 @@ jobs:
6969
TORCH_URL_STABLE: "https://download.pytorch.org/whl/cpu/"
7070
TORCH_URL_TEST: "https://download.pytorch.org/whl/test/cpu/"
7171
FREEZE_REQUIREMENTS: ${{ ! (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release/')) }}
72-
PYPI_CACHE_DIR: "_pip-wheels"
7372
# TODO: Remove this - Enable running MPS tests on this platform
7473
DISABLE_MPS: ${{ matrix.os == 'macOS-14' && '1' || '0' }}
7574
steps:
7675
- uses: actions/checkout@v5
7776

78-
- name: Set up Python ${{ matrix.config.python-version }}
79-
uses: actions/setup-python@v5
77+
- name: Install uv and set Python version
78+
uses: astral-sh/setup-uv@v6
8079
with:
8180
python-version: ${{ matrix.config.python-version || '3.9' }}
81+
# TODO: Avoid activating environment like this
82+
# see: https://github.com/astral-sh/setup-uv/tree/v6/?tab=readme-ov-file#activate-environment
83+
activate-environment: true
84+
enable-cache: true
8285

83-
- name: basic setup
84-
run: pip install -q -r .actions/requirements.txt
86+
- name: Basic setup
87+
run: uv pip install -q -r .actions/requirements.txt
88+
89+
- name: Append Env. vars for Linux
90+
if: ${{ runner.os == 'Linux' }}
91+
run: echo "GLOO_SOCKET_IFNAME=eth0" >> $GITHUB_ENV
92+
93+
- name: Append Env. vars for MacOS
94+
if: ${{ runner.os == 'macOS' }}
95+
run: echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV
8596

8697
- name: Set min. dependencies
8798
if: ${{ matrix.config.requires == 'oldest' }}
8899
run: |
89100
cd requirements/pytorch
90-
pip install -U "lightning-utilities[cli]"
101+
uv pip install -U "lightning-utilities[cli]"
91102
python -m lightning_utilities.cli requirements set-oldest --req_files "['base.txt', 'extra.txt', 'strategies.txt', 'examples.txt', 'test.txt']"
92-
pip install "cython<3.0" wheel
93-
pip install "pyyaml==5.4" --no-build-isolation
103+
uv pip install "cython<3.0" wheel
104+
uv pip install "pyyaml==5.4" --no-build-isolation
94105
95106
- name: Adjust PyTorch versions in requirements files
96107
if: ${{ matrix.config.requires != 'oldest' }}
97108
run: |
98-
pip install -q -r requirements/ci.txt
109+
uv pip install -q -r requirements/ci.txt
99110
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
100111
for fpath in `ls requirements/**/*.txt`; do \
101112
python ./adjust-torch-versions.py $fpath ${{ matrix.config.pytorch-version }}; \
102113
done
103114
cat requirements/pytorch/base.txt
104115
105-
- name: pip wheels cache
106-
uses: actions/cache/restore@v4
107-
with:
108-
path: ${{ env.PYPI_CACHE_DIR }}
109-
key: pypi_wheels
110-
- run: |
111-
mkdir -p $PYPI_CACHE_DIR
112-
ls -lh $PYPI_CACHE_DIR
113-
114116
- name: Env. variables
115117
run: |
116118
# Switch PyTorch URL between stable and test/future
@@ -125,20 +127,22 @@ jobs:
125127
- name: Install package & dependencies
126128
timeout-minutes: 20
127129
run: |
128-
pip install ".[${EXTRA_PREFIX}extra,${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
129-
-U --upgrade-strategy=eager --prefer-binary \
130+
uv pip install ".[${EXTRA_PREFIX}extra,${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
131+
--upgrade \
130132
-r requirements/_integrations/accelerators.txt \
131-
--extra-index-url="${TORCH_URL}" \
132-
--find-links="${PYPI_CACHE_DIR}" \
133+
--find-links="${TORCH_URL}" \
133134
--find-links="https://download.pytorch.org/whl/torch-tensorrt"
134-
pip list
135+
uv pip list
136+
135137
- name: Drop LAI from extensions
136138
if: ${{ matrix.config.pkg-name != 'lightning' }}
137139
# Lightning is dependency of Habana or other accelerators/integrations so in case we test PL we need to remove it
138-
run: pip uninstall -y lightning
140+
run: uv pip uninstall lightning
141+
139142
- name: Drop PL for LAI
140143
if: ${{ matrix.config.pkg-name == 'lightning' }}
141-
run: pip uninstall -y pytorch-lightning
144+
run: uv pip uninstall pytorch-lightning
145+
142146
- name: Dump handy wheels
143147
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
144148
continue-on-error: true
@@ -215,6 +219,9 @@ jobs:
215219
name: CPU-coverage
216220
fail_ci_if_error: false
217221

222+
- name: Minimize uv cache
223+
run: uv cache prune --ci
224+
218225
pl-cpu-guardian:
219226
runs-on: ubuntu-latest
220227
needs: pl-cpu

docs/source-fabric/guide/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Build your own Trainer
7878
<div class="row">
7979

8080
.. displayitem::
81-
:header: Organize your model code with with LightningModule
81+
:header: Organize your model code with LightningModule
8282
:description: Organize your code in a LightningModule and use it with Fabric
8383
:button_link: lightning_module.html
8484
:col_css: col-md-4

docs/source-fabric/levels/intermediate.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Intermediate skills
1919
<div class="row">
2020

2121
.. displayitem::
22-
:header: Organize your model code with with LightningModule
22+
:header: Organize your model code with LightningModule
2323
:description: Organize your code in a LightningModule and use it with Fabric
2424
:button_link: ../guide/lightning_module.html
2525
:col_css: col-md-4

docs/source-pytorch/accelerators/gpu_faq.rst

Lines changed: 54 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,71 @@
55
GPU training (FAQ)
66
==================
77

8-
******************************************************************
9-
How should I adjust the learning rate when using multiple devices?
10-
******************************************************************
8+
***************************************************************
9+
How should I adjust the batch size when using multiple devices?
10+
***************************************************************
1111

12-
When using distributed training make sure to modify your learning rate according to your effective
13-
batch size.
12+
Lightning automatically shards your data across multiple GPUs, meaning that each device only sees a unique subset of your
13+
data, but the `batch_size` in your DataLoader remains the same. This means that the effective batch size e.g. the
14+
total number of samples processed in one forward/backward pass is
1415

15-
Let's say you have a batch size of 7 in your dataloader.
16+
.. math::
1617
17-
.. testcode::
18+
\text{Effective Batch Size} = \text{DataLoader Batch Size} \times \text{Number of Devices} \times \text{Number of Nodes}
1819
19-
class LitModel(LightningModule):
20-
def train_dataloader(self):
21-
return Dataset(..., batch_size=7)
22-
23-
Whenever you use multiple devices and/or nodes, your effective batch size will be 7 * devices * num_nodes.
20+
A couple of examples to illustrate this:
2421

2522
.. code-block:: python
2623
27-
# effective batch size = 7 * 8
24+
dataloader = DataLoader(..., batch_size=7)
25+
26+
# Single GPU: effective batch size = 7
27+
Trainer(accelerator="gpu", devices=1)
28+
29+
# Multi-GPU: effective batch size = 7 * 8 = 56
2830
Trainer(accelerator="gpu", devices=8, strategy=...)
2931
30-
# effective batch size = 7 * 8 * 10
32+
# Multi-node: effective batch size = 7 * 8 * 10 = 560
3133
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy=...)
3234
35+
In general you should be able to use the same `batch_size` in your DataLoader regardless of the number of devices you are
36+
using.
37+
38+
.. note::
39+
40+
If you want distributed training to work exactly the same as single GPU training, you need to set the `batch_size`
41+
in your DataLoader to `original_batch_size / num_devices` to maintain the same effective batch size. However, this
42+
can lead to poor GPU utilization.
43+
44+
----
45+
46+
******************************************************************
47+
How should I adjust the learning rate when using multiple devices?
48+
******************************************************************
49+
50+
Because the effective batch size is larger when using multiple devices, you need to adjust your learning rate
51+
accordingly. Because the learning rate is a hyperparameter that controls how much to change the model in response to
52+
the estimated error each time the model weights are updated, it is important to scale it with the effective batch size.
53+
54+
In general, there are two common scaling rules:
55+
56+
1. **Linear scaling**: Increase the learning rate linearly with the number of devices.
57+
58+
.. code-block:: python
59+
60+
# Example: Linear scaling
61+
base_lr = 1e-3
62+
num_devices = 8
63+
scaled_lr = base_lr * num_devices # 8e-3
64+
65+
2. **Square root scaling**: Increase the learning rate by the square root of the number of devices.
66+
67+
.. code-block:: python
68+
69+
# Example: Square root scaling
70+
base_lr = 1e-3
71+
num_devices = 8
72+
scaled_lr = base_lr * (num_devices ** 0.5) # 2.83e-3
3373
3474
.. note:: Huge batch sizes are actually really bad for convergence. Check out:
3575
`Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_

0 commit comments

Comments
 (0)