Skip to content

Commit a775804

Browse files
kaushikb11rohitgr7
andauthored
Update Plugins doc (#12440)
Co-authored-by: rohitgr7 <[email protected]>
1 parent 71e25f3 commit a775804

File tree

14 files changed

+70
-69
lines changed

14 files changed

+70
-69
lines changed

docs/source/advanced/model_parallel.rst

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,6 @@ Below we show an example of running `ZeRO-Offload <https://www.deepspeed.ai/tuto
296296
.. code-block:: python
297297
298298
from pytorch_lightning import Trainer
299-
from pytorch_lightning.strategies import DeepSpeedStrategy
300299
301300
model = MyModel()
302301
trainer = Trainer(accelerator="gpu", devices=4, strategy="deepspeed_stage_2_offload", precision=16)
@@ -341,7 +340,6 @@ For even more speed benefit, DeepSpeed offers an optimized CPU version of ADAM c
341340
342341
import pytorch_lightning
343342
from pytorch_lightning import Trainer
344-
from pytorch_lightning.strategies import DeepSpeedStrategy
345343
from deepspeed.ops.adam import DeepSpeedCPUAdam
346344
347345
@@ -385,7 +383,6 @@ Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains
385383
.. code-block:: python
386384
387385
from pytorch_lightning import Trainer
388-
from pytorch_lightning.strategies import DeepSpeedStrategy
389386
from deepspeed.ops.adam import FusedAdam
390387
391388
@@ -409,7 +406,6 @@ You can also use the Lightning Trainer to run predict or evaluate with DeepSpeed
409406
.. code-block:: python
410407
411408
from pytorch_lightning import Trainer
412-
from pytorch_lightning.strategies import DeepSpeedStrategy
413409
414410
415411
class MyModel(pl.LightningModule):
@@ -435,7 +431,6 @@ This reduces the time taken to initialize very large models, as well as ensure w
435431
436432
import torch.nn as nn
437433
from pytorch_lightning import Trainer
438-
from pytorch_lightning.strategies import DeepSpeedStrategy
439434
from deepspeed.ops.adam import FusedAdam
440435
441436
@@ -549,7 +544,6 @@ This saves memory when training larger models, however requires using a checkpoi
549544
.. code-block:: python
550545
551546
from pytorch_lightning import Trainer
552-
from pytorch_lightning.strategies import DeepSpeedStrategy
553547
import deepspeed
554548
555549
@@ -686,7 +680,7 @@ In some cases you may want to define your own DeepSpeed Config, to access all pa
686680
}
687681
688682
model = MyModel()
689-
trainer = Trainer(accelerator="gpu", devices=4, strategy=DeepSpeedStrategy(deepspeed_config), precision=16)
683+
trainer = Trainer(accelerator="gpu", devices=4, strategy=DeepSpeedStrategy(config=deepspeed_config), precision=16)
690684
trainer.fit(model)
691685
692686
@@ -699,7 +693,7 @@ We support taking the config as a json formatted file:
699693
700694
model = MyModel()
701695
trainer = Trainer(
702-
accelerator="gpu", devices=4, strategy=DeepSpeedStrategy("/path/to/deepspeed_config.json"), precision=16
696+
accelerator="gpu", devices=4, strategy=DeepSpeedStrategy(config="/path/to/deepspeed_config.json"), precision=16
703697
)
704698
trainer.fit(model)
705699

docs/source/advanced/training_tricks.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -331,8 +331,7 @@ However, for in-memory datasets, that means that each process will hold a (redun
331331
For example, when training Graph Neural Networks, a common strategy is to load the entire graph into CPU memory for fast access to the entire graph structure and its features, and to then perform neighbor sampling to obtain mini-batches that fit onto the GPU.
332332

333333
A simple way to prevent redundant dataset replicas is to rely on :obj:`torch.multiprocessing` to share the `data automatically between spawned processes via shared memory <https://pytorch.org/docs/stable/notes/multiprocessing.html>`_.
334-
For this, all data pre-loading should be done on the main process inside :meth:`DataModule.__init__`. As a result, all tensor-data will get automatically shared when using the :class:`~pytorch_lightning.plugins.strategies.ddp_spawn.DDPSpawnStrategy`
335-
training type strategy:
334+
For this, all data pre-loading should be done on the main process inside :meth:`DataModule.__init__`. As a result, all tensor-data will get automatically shared when using the :class:`~pytorch_lightning.plugins.strategies.ddp_spawn.DDPSpawnStrategy` strategy.
336335

337336
.. warning::
338337

docs/source/common/checkpointing.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,7 @@ and the Lightning Team will be happy to integrate/help integrate it.
315315

316316
-----------
317317

318+
.. _customize_checkpointing:
318319

319320
***********************
320321
Customize Checkpointing
@@ -392,7 +393,7 @@ Custom Checkpoint IO Plugin
392393
393394
.. note::
394395

395-
Some ``TrainingTypePlugins`` like ``DeepSpeedStrategy`` do not support custom ``CheckpointIO`` as checkpointing logic is not modifiable.
396+
Some strategies like :class:`~pytorch_lightning.strategies.deepspeed.DeepSpeedStrategy` do not support custom :class:`~pytorch_lightning.plugins.io.checkpoint_plugin.CheckpointIO` as checkpointing logic is not modifiable.
396397

397398
-----------
398399

docs/source/common/lightning_module.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1056,7 +1056,7 @@ automatic_optimization
10561056
When set to ``False``, Lightning does not automate the optimization process. This means you are responsible for handling
10571057
your optimizers. However, we do take care of precision and any accelerators used.
10581058

1059-
See :ref:`manual optimization<common/optimization:Manual optimization>` for details.
1059+
See :ref:`manual optimization <common/optimization:Manual optimization>` for details.
10601060

10611061
.. code-block:: python
10621062

docs/source/common/trainer.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1445,7 +1445,7 @@ checkpoint, training will start from the beginning of the next epoch.
14451445
strategy
14461446
^^^^^^^^
14471447

1448-
Supports passing different training strategies with aliases (ddp, ddp_spawn, etc) as well as custom training type plugins.
1448+
Supports passing different training strategies with aliases (ddp, ddp_spawn, etc) as well as custom strategies.
14491449

14501450
.. code-block:: python
14511451
@@ -1455,7 +1455,7 @@ Supports passing different training strategies with aliases (ddp, ddp_spawn, etc
14551455
# Training with the DDP Spawn strategy using 4 cpu processes
14561456
trainer = Trainer(strategy="ddp_spawn", accelerator="cpu", devices=4)
14571457
1458-
.. note:: Additionally, you can pass your custom training type plugins to the ``strategy`` argument.
1458+
.. note:: Additionally, you can pass your custom strategy to the ``strategy`` argument.
14591459

14601460
.. code-block:: python
14611461

docs/source/extensions/plugins.rst

Lines changed: 49 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -6,54 +6,32 @@ Plugins
66

77
.. include:: ../links.rst
88

9-
Plugins allow custom integrations to the internals of the Trainer such as a custom precision or
10-
distributed implementation.
9+
Plugins allow custom integrations to the internals of the Trainer such as custom precision, checkpointing or
10+
cluster environment implementation.
1111

1212
Under the hood, the Lightning Trainer is using plugins in the training routine, added automatically
13-
depending on the provided Trainer arguments. For example:
13+
depending on the provided Trainer arguments.
1414

15-
.. code-block:: python
16-
17-
# accelerator: GPUAccelerator
18-
# training strategy: DDPStrategy
19-
# precision: NativeMixedPrecisionPlugin
20-
trainer = Trainer(accelerator="gpu", devices=4, precision=16)
21-
22-
23-
We expose Accelerators and Plugins mainly for expert users that want to extend Lightning for:
24-
25-
- New hardware (like TPU plugin)
26-
- Distributed backends (e.g. a backend not yet supported by
27-
`PyTorch <https://pytorch.org/docs/stable/distributed.html#backends>`_ itself)
28-
- Clusters (e.g. customized access to the cluster's environment interface)
29-
30-
There are two types of Plugins in Lightning with different responsibilities:
31-
32-
Strategy
33-
--------
34-
35-
- Launching and teardown of training processes (if applicable)
36-
- Setup communication between processes (NCCL, GLOO, MPI, ...)
37-
- Provide a unified communication interface for reduction, broadcast, etc.
38-
- Provide access to the wrapped LightningModule
15+
There are three types of Plugins in Lightning with different responsibilities:
3916

17+
- Precision Plugins
18+
- CheckpointIO Plugins
19+
- Cluster Environments
4020

41-
Furthermore, for multi-node training Lightning provides cluster environment plugins that allow the advanced user
42-
to configure Lightning to integrate with a :ref:`custom-cluster`.
4321

22+
*****************
23+
Precision Plugins
24+
*****************
4425

45-
.. image:: ../_static/images/accelerator/overview.svg
46-
47-
48-
The full list of built-in plugins is listed below.
49-
26+
We provide precision plugins for you to benefit from numerical representations with lower precision than
27+
32-bit floating-point or higher precision, such as 64-bit floating-point.
5028

51-
.. warning:: The Plugin API is in beta and subject to change.
52-
For help setting up custom plugins/accelerators, please reach out to us at **[email protected]**
29+
.. code-block:: python
5330
31+
# Training with 16-bit precision
32+
trainer = Trainer(precision=16)
5433
55-
Precision Plugins
56-
-----------------
34+
The full list of built-in precision plugins is listed below.
5735

5836
.. currentmodule:: pytorch_lightning.plugins.precision
5937

@@ -74,9 +52,39 @@ Precision Plugins
7452
TPUBf16PrecisionPlugin
7553
TPUPrecisionPlugin
7654

55+
More information regarding precision with Lightning can be found :doc:`here <../advanced/precision>`
56+
57+
-----------
58+
59+
********************
60+
CheckpointIO Plugins
61+
********************
7762

63+
As part of our commitment to extensibility, we have abstracted Lightning's checkpointing logic into the :class:`~pytorch_lightning.plugins.io.CheckpointIO` plugin.
64+
With this, you have the ability to customize the checkpointing logic to match the needs of your infrastructure.
65+
66+
Below is a list of built-in plugins for checkpointing.
67+
68+
.. currentmodule:: pytorch_lightning.plugins.io
69+
70+
.. autosummary::
71+
:nosignatures:
72+
:template: classtemplate.rst
73+
74+
CheckpointIO
75+
HPUCheckpointIO
76+
TorchCheckpointIO
77+
XLACheckpointIO
78+
79+
You could learn more about custom checkpointing with Lightning :ref:`here <customize_checkpointing>`.
80+
81+
-----------
82+
83+
********************
7884
Cluster Environments
79-
--------------------
85+
********************
86+
87+
You can define the interface of your own cluster environment based on the requirements of your infrastructure.
8088

8189
.. currentmodule:: pytorch_lightning.plugins.environments
8290

@@ -85,8 +93,8 @@ Cluster Environments
8593
:template: classtemplate.rst
8694

8795
ClusterEnvironment
96+
KubeflowEnvironment
8897
LightningEnvironment
8998
LSFEnvironment
90-
TorchElasticEnvironment
91-
KubeflowEnvironment
9299
SLURMEnvironment
100+
TorchElasticEnvironment

docs/source/starter/lightning_lite.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ Choose a training strategy: ``"dp"``, ``"ddp"``, ``"ddp_spawn"``, ``"tpu_spawn"`
387387
lite = Lite(strategy="ddp_spawn", accelerator="cpu", devices=4)
388388
389389
390-
Additionally, you can pass in your custom training type strategy by configuring additional parameters.
390+
Additionally, you can pass in your custom strategy by configuring additional parameters.
391391

392392
.. code-block:: python
393393

pytorch_lightning/loops/optimization/optimizer_loop.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ def _run_optimization(
235235
closure = self._make_closure(split_batch, batch_idx, opt_idx, optimizer)
236236

237237
if (
238-
# when the training type plugin handles accumulation, we want to always call the optimizer step
238+
# when the strategy handles accumulation, we want to always call the optimizer step
239239
not self.trainer.strategy.handles_gradient_accumulation
240240
and self.trainer.fit_loop._should_accumulate()
241241
):

pytorch_lightning/strategies/strategy.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,7 @@
4040

4141

4242
class Strategy(ABC):
43-
"""Base class for all training type plugins that change the behaviour of the training, validation and test-
44-
loop."""
43+
"""Base class for all strategies that change the behaviour of the training, validation and test- loop."""
4544

4645
def __init__(
4746
self,

pytorch_lightning/trainer/trainer.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -401,7 +401,7 @@ def __init__(
401401
Please pass the path to ``Trainer.fit(..., ckpt_path=...)`` instead.
402402
403403
strategy: Supports different training strategies with aliases
404-
as well custom training type plugins.
404+
as well custom strategies.
405405
Default: ``None``.
406406
407407
sync_batchnorm: Synchronize batch norm layers between process groups/whole world.
@@ -1152,7 +1152,7 @@ def _run(
11521152
if hasattr(model, "hparams"):
11531153
parsing.clean_namespace(model.hparams)
11541154

1155-
# attach model to the training type plugin
1155+
# attach model to the strategy
11561156
self.strategy.connect(model)
11571157

11581158
self._callback_connector._attach_model_callbacks()
@@ -2035,17 +2035,17 @@ def global_rank(self) -> int:
20352035

20362036
@property
20372037
def local_rank(self) -> int:
2038-
# some training types define a local rank
2038+
# some strategies define a local rank
20392039
return getattr(self.strategy, "local_rank", 0)
20402040

20412041
@property
20422042
def node_rank(self) -> int:
2043-
# some training types define a node rank
2043+
# some strategies define a node rank
20442044
return getattr(self.strategy, "node_rank", 0)
20452045

20462046
@property
20472047
def world_size(self) -> int:
2048-
# some training types define a world size
2048+
# some strategies define a world size
20492049
return getattr(self.strategy, "world_size", 1)
20502050

20512051
@property

0 commit comments

Comments
 (0)