Skip to content

Commit 005209c

Browse files
authored
Merge branch 'master' into bug/18727_reset_trainer_should_stop_on_fit
2 parents 6a62f66 + 2a827f3 commit 005209c

File tree

64 files changed

+807
-318
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+807
-318
lines changed

.azure/gpu-tests-pytorch.yml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -105,16 +105,9 @@ jobs:
105105
done
106106
displayName: "Adjust dependencies"
107107
108-
- bash: |
109-
pip install -q -r .actions/requirements.txt
110-
python .actions/assistant.py requirements_prune_pkgs \
111-
--packages="[lightning-colossalai]" \
112-
--req_files="[requirements/_integrations/strategies.txt]"
113-
displayName: "Prune packages" # these have installation issues
114-
115108
- bash: |
116109
extra=$(python -c "print({'lightning': 'pytorch-'}.get('$(PACKAGE_NAME)', ''))")
117-
pip install -e ".[${extra}dev]" -r requirements/_integrations/strategies.txt pytest-timeout -U --find-links="${TORCH_URL}"
110+
pip install -e ".[${extra}dev]" pytest-timeout -U --find-links="${TORCH_URL}"
118111
displayName: "Install package & dependencies"
119112
120113
- bash: pip uninstall -y lightning

.github/workflows/ci-examples-app.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ jobs:
6767
run: python .actions/assistant.py replace_oldest_ver
6868

6969
- name: pip wheels cache
70-
uses: actions/cache/restore@v3
70+
uses: actions/cache/restore@v4
7171
with:
7272
path: ${{ env.PYPI_CACHE_DIR }}
7373
key: pypi_wheels

.github/workflows/ci-tests-app.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ jobs:
7373
run: python .actions/assistant.py replace_oldest_ver
7474

7575
- name: pip wheels cache
76-
uses: actions/cache/restore@v3
76+
uses: actions/cache/restore@v4
7777
with:
7878
path: ${{ env.PYPI_CACHE_DIR }}
7979
key: pypi_wheels

.github/workflows/ci-tests-fabric.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ jobs:
114114
done
115115
116116
- name: pip wheels cache
117-
uses: actions/cache/restore@v3
117+
uses: actions/cache/restore@v4
118118
with:
119119
path: ${{ env.PYPI_CACHE_DIR }}
120120
key: pypi_wheels

.github/workflows/ci-tests-pytorch.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ jobs:
120120
cat requirements/pytorch/base.txt
121121
122122
- name: pip wheels cache
123-
uses: actions/cache/restore@v3
123+
uses: actions/cache/restore@v4
124124
with:
125125
path: ${{ env.PYPI_CACHE_DIR }}
126126
key: pypi_wheels
@@ -161,7 +161,7 @@ jobs:
161161
cache-key: "pypi_wheels"
162162

163163
- name: Cache datasets
164-
uses: actions/cache@v3
164+
uses: actions/cache@v4
165165
with:
166166
path: Datasets
167167
key: pl-dataset

.github/workflows/code-checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
python-version: "3.10.6"
3535

3636
- name: Mypy cache
37-
uses: actions/cache@v3
37+
uses: actions/cache@v4
3838
with:
3939
path: .mypy_cache
4040
key: mypy-${{ hashFiles('requirements/typing.txt') }}

.github/workflows/docs-build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ jobs:
8080
pip install lai-sphinx-theme -U -f ${PYPI_LOCAL_DIR}
8181
8282
- name: pip wheels cache
83-
uses: actions/cache/restore@v3
83+
uses: actions/cache/restore@v4
8484
with:
8585
path: ${{ env.PYPI_CACHE_DIR }}
8686
key: pypi_wheels

docs/source-fabric/fundamentals/convert.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,21 @@ Check out our before-and-after example for `image classification <https://github
9090
----
9191

9292

93+
****************
94+
Optional changes
95+
****************
96+
97+
Here are a few optional upgrades you can make to your code, if applicable:
98+
99+
- Replace ``torch.save()`` and ``torch.load()`` with Fabric's :doc:`save and load methods <../guide/checkpoint/checkpoint>`.
100+
- Replace collective operations from ``torch.distributed`` (barrier, broadcast, etc.) with Fabric's :doc:`collective methods <../advanced/distributed_communication>`.
101+
- Use Fabric's :doc:`no_backward_sync() context manager <../advanced/gradient_accumulation>` if you implemented gradient accumulation.
102+
- Initialize your model under the :doc:`init_module() <../advanced/model_init>` context manager.
103+
104+
105+
----
106+
107+
93108
**********
94109
Next steps
95110
**********

docs/source-pytorch/cli/lightning_cli_advanced_3.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,7 @@ Since the init parameters of the model have as a type hint a class, in the confi
197197
decoder: Instance of a module for decoding
198198
"""
199199
super().__init__()
200+
self.save_hyperparameters()
200201
self.encoder = encoder
201202
self.decoder = decoder
202203

@@ -216,6 +217,13 @@ If the CLI is implemented as ``LightningCLI(MyMainModel)`` the configuration wou
216217
217218
It is also possible to combine ``subclass_mode_model=True`` and submodules, thereby having two levels of ``class_path``.
218219

220+
.. tip::
221+
222+
By having ``self.save_hyperparameters()`` it becomes possible to load the model from a checkpoint. Simply do
223+
``ModelClass.load_from_checkpoint("path/to/checkpoint.ckpt")``. In the case of using ``subclass_mode_model=True``,
224+
then load it like ``LightningModule.load_from_checkpoint("path/to/checkpoint.ckpt")``. ``save_hyperparameters`` is
225+
optional and can be safely removed if there is no need to load from a checkpoint.
226+
219227

220228
Fixed optimizer and scheduler
221229
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -279,6 +287,7 @@ An example of a model that uses two optimizers is the following:
279287
class MyModel(LightningModule):
280288
def __init__(self, optimizer1: OptimizerCallable, optimizer2: OptimizerCallable):
281289
super().__init__()
290+
self.save_hyperparameters()
282291
self.optimizer1 = optimizer1
283292
self.optimizer2 = optimizer2
284293
@@ -318,6 +327,7 @@ that uses dependency injection for an optimizer and a learning scheduler is:
318327
scheduler: LRSchedulerCallable = torch.optim.lr_scheduler.ConstantLR,
319328
):
320329
super().__init__()
330+
self.save_hyperparameters()
321331
self.optimizer = optimizer
322332
self.scheduler = scheduler
323333

docs/source-pytorch/common/checkpointing_intermediate.rst

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,9 +167,11 @@ In distributed training cases where a model is running across many machines, Lig
167167
trainer = Trainer(strategy="ddp")
168168
model = MyLightningModule(hparams)
169169
trainer.fit(model)
170+
170171
# Saves only on the main process
172+
# Handles strategy-specific saving logic like XLA, FSDP, DeepSpeed etc.
171173
trainer.save_checkpoint("example.ckpt")
172174
173-
Not using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` can lead to unexpected behavior and potential deadlock. Using other saving functions will result in all devices attempting to save the checkpoint. As a result, we highly recommend using the Trainer's save functionality.
174-
If using custom saving functions cannot be avoided, we recommend using the :func:`~lightning.pytorch.utilities.rank_zero.rank_zero_only` decorator to ensure saving occurs only on the main process. Note that this will only work if all ranks hold the exact same state and won't work when using
175-
model parallel distributed strategies such as deepspeed or sharded training.
175+
176+
By using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` instead of ``torch.save``, you make your code agnostic to the distributed training strategy being used.
177+
It will ensure that checkpoints are saved correctly in a multi-process setting, avoiding race conditions, deadlocks and other common issues that normally require boilerplate code to handle properly.

0 commit comments

Comments
 (0)