Skip to content

Commit 281efc6

Browse files
Update data docs (#16839)
Co-authored-by: Justus Schock <[email protected]>
1 parent 67b94ef commit 281efc6

File tree

8 files changed

+237
-410
lines changed

8 files changed

+237
-410
lines changed
File renamed without changes.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
:orphan:
2+
3+
Accessing DataLoaders
4+
=====================
5+
6+
In the case that you require access to the :class:`torch.utils.data.DataLoader` or :class:`torch.utils.data.Dataset` objects, DataLoaders for each step can be accessed
7+
via the trainer properties :meth:`~lightning.pytorch.trainer.trainer.Trainer.train_dataloader`,
8+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.val_dataloaders`,
9+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.test_dataloaders`, and
10+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.predict_dataloaders`.
11+
12+
.. code-block:: python
13+
14+
dataloaders = trainer.train_dataloader
15+
dataloaders = trainer.val_dataloaders
16+
dataloaders = trainer.test_dataloaders
17+
dataloaders = trainer.predict_dataloaders
18+
19+
These properties will match exactly what was returned in your ``*_dataloader`` hooks or passed to the ``Trainer``,
20+
meaning that if you returned a dictionary of dataloaders, these will return a dictionary of dataloaders.
21+
22+
Replacing DataLoaders
23+
---------------------
24+
25+
If you are using a :class:`~lightning.pytorch.utilities.CombinedLoader`. A flattened list of DataLoaders can be accessed by doing:
26+
27+
.. code-block:: python
28+
29+
from lightning.pytorch.utilities import CombinedLoader
30+
31+
iterables = {"dl1": dl1, "dl2": dl2}
32+
combined_loader = CombinedLoader(iterables)
33+
# access the original iterables
34+
assert combined_loader.iterables is iterables
35+
# the `.flattened` property can be convenient
36+
assert combined_loader.flattened == [dl1, dl2]
37+
# for example, to do a simple loop
38+
updated = []
39+
for dl in combined_loader.flattened:
40+
new_dl = apply_some_transformation_to(dl)
41+
updated.append(new_dl)
42+
# it also allows you to easily replace the dataloaders
43+
combined_loader.flattened = updated

docs/source-pytorch/data/custom_data_iterables.rst renamed to docs/source-pytorch/data/alternatives.rst

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
1+
:orphan:
2+
13
.. _dataiters:
24

3-
##################################
4-
Injecting 3rd Party Data Iterables
5-
##################################
5+
Using 3rd Party Data Iterables
6+
==============================
67

78
When training a model on a specific task, data loading and preprocessing might become a bottleneck.
89
Lightning does not enforce a specific data loading approach nor does it try to control it.
9-
The only assumption Lightning makes is that the data is returned as an iterable of batches.
10+
The only assumption Lightning makes is that a valid iterable is provided.
1011

1112
For PyTorch-based programs, these iterables are typically instances of :class:`~torch.utils.data.DataLoader`.
12-
13-
However, Lightning also supports other data types such as plain list of batches, generators or other custom iterables.
13+
However, Lightning also supports other data types such as a list of batches, generators, or other custom iterables or
14+
collections of the former.
1415

1516
.. code-block:: python
1617
@@ -20,13 +21,24 @@ However, Lightning also supports other data types such as plain list of batches,
2021
trainer = Trainer()
2122
trainer.fit(model, data)
2223
23-
Examples for custom iterables include `NVIDIA DALI <https://github.com/NVIDIA/DALI>`__ or `FFCV <https://github.com/libffcv/ffcv>`__ for computer vision.
24-
Both libraries offer support for custom data loading and preprocessing (also hardware accelerated) and can be used with Lightning.
24+
Below we showcase Lightning examples with packages that compete with the generic PyTorch DataLoader and might be
25+
faster depending on your use case. They might require custom data serialization, loading, and preprocessing that
26+
is often hardware accelerated.
27+
28+
.. TODO(carmocca)
29+
StreamingDataset
30+
^^^^^^^^^^^^^^^^
31+
32+
The `StreamingDataset <https://github.com/mosaicml/streaming>`__
2533
34+
FFCV
35+
^^^^
2636

27-
For example, taking the example from FFCV's readme, we can use it with Lightning by just removing the hardcoded ``ToDevice(0)``
28-
as Lightning takes care of GPU placement. In case you want to use some data transformations on GPUs, change the
29-
``ToDevice(0)`` to ``ToDevice(self.trainer.local_rank)`` to correctly map to the desired GPU in your pipeline.
37+
Taking the example from the `FFCV <https://github.com/libffcv/ffcv>`__ readme, we can use it with Lightning
38+
by just removing the hardcoded ``ToDevice(0)`` as Lightning takes care of GPU placement. In case you want to use some
39+
data transformations on GPUs, change the ``ToDevice(0)`` to ``ToDevice(self.trainer.local_rank)`` to correctly map to
40+
the desired GPU in your pipeline. When moving data to a specific device, you can always refer to
41+
``self.trainer.local_rank`` to get the accelerator used by the current process.
3042

3143
.. code-block:: python
3244
@@ -54,8 +66,15 @@ as Lightning takes care of GPU placement. In case you want to use some data tran
5466
5567
return loader
5668
57-
When moving data to a specific device, you can always refer to ``self.trainer.local_rank`` to get the accelerator
58-
used by the current process.
69+
70+
.. TODO(carmocca)
71+
WebDataset
72+
^^^^^^^^^^
73+
74+
The `WebDataset <https://webdataset.github.io/webdataset>`__
75+
76+
NVIDIA DALI
77+
^^^^^^^^^^^
5978

6079
By just changing ``device_id=0`` to ``device_id=self.trainer.local_rank`` we can also leverage DALI's GPU decoding:
6180

@@ -107,8 +126,8 @@ Lightning works with all kinds of custom data iterables as shown above. There ar
107126
be supported this way. These restrictions come from the fact that for their support,
108127
Lightning needs to know a lot on the internals of these iterables.
109128

110-
- In a distributed multi-GPU setting (ddp),
111-
Lightning automatically replaces the DataLoader's sampler with its distributed counterpart.
112-
This makes sure that each GPU sees a different part of the dataset.
113-
As sampling can be implemented in arbitrary ways with custom iterables,
114-
there is no way for Lightning to know, how to replace the sampler.
129+
- In a distributed multi-GPU setting (ddp), Lightning wraps the DataLoader's sampler with a wrapper for distributed
130+
support. This makes sure that each GPU sees a different part of the dataset. As sampling can be implemented in
131+
arbitrary ways with custom iterables, Lightning might not be able to do this for you. If this is the case, you can use
132+
the :paramref:`~lightning.pytorch.trainer.trainer.Trainer.use_distributed_sampler` argument to disable this logic and
133+
set the distributed sampler yourself.

docs/source-pytorch/data/data.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
.. _data:
2+
3+
Complex data uses
4+
=================
5+
6+
.. raw:: html
7+
8+
<div class="display-card-container">
9+
<div class="row">
10+
11+
.. displayitem::
12+
:header: LightningDataModules
13+
:description: Introduction to the LightningDataModule
14+
:col_css: col-md-4
15+
:button_link: datamodule.html
16+
:height: 150
17+
:tag: basic
18+
19+
.. displayitem::
20+
:header: Iterables
21+
:description: What is an iterable? How do I use them?
22+
:col_css: col-md-4
23+
:button_link: iterables.html
24+
:height: 150
25+
:tag: basic
26+
27+
.. displayitem::
28+
:header: Access your data
29+
:description: How to access your dataloaders
30+
:col_css: col-md-4
31+
:button_link: access.html
32+
:height: 150
33+
:tag: basic
34+
35+
.. displayitem::
36+
:header: Streaming datasets
37+
:description: Using iterable-style datasets with Lightning
38+
:col_css: col-md-4
39+
:button_link: streaming.html
40+
:height: 150
41+
:tag: intermediate
42+
43+
.. displayitem::
44+
:header: Faster DataLoaders
45+
:description: How alternative dataloader projects can be used with Lightning
46+
:col_css: col-md-4
47+
:button_link: alternatives.html
48+
:height: 150
49+
:tag: advanced
50+
51+
.. raw:: html
52+
53+
</div>
54+
</div>

docs/source-pytorch/data/datamodule.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ This class can then be shared and used anywhere:
2525

2626
.. code-block:: python
2727
28-
from pl_bolts.datamodules import CIFAR10DataModule, ImagenetDataModule
29-
3028
model = LitClassifier()
3129
trainer = Trainer()
3230
@@ -56,8 +54,11 @@ Datamodules are for you if you ever asked the questions:
5654
*********************
5755
What is a DataModule?
5856
*********************
59-
A DataModule is simply a collection of a train_dataloader(s), val_dataloader(s), test_dataloader(s) and
60-
predict_dataloader(s) along with the matching transforms and data processing/downloads steps required.
57+
58+
The :class:`~lightning.pytorch.core.datamodule.LightningDataModule` is a convenient way to manage data in PyTorch Lightning.
59+
It encapsulates training, validation, testing, and prediction dataloaders, as well as any necessary steps for data processing,
60+
downloads, and transformations. By using a :class:`~lightning.pytorch.core.datamodule.LightningDataModule`, you can
61+
easily develop dataset-agnostic models, hot-swap different datasets, and share data splits and transformations across projects.
6162

6263
Here's a simple PyTorch example:
6364

@@ -411,7 +412,10 @@ the method runs on the correct devices).
411412
trainer.test(datamodule=dm)
412413
413414
You can access the current used datamodule of a trainer via ``trainer.datamodule`` and the current used
414-
dataloaders via ``trainer.train_dataloader``, ``trainer.val_dataloaders`` and ``trainer.test_dataloaders``.
415+
dataloaders via the trainer properties :meth:`~lightning.pytorch.trainer.trainer.Trainer.train_dataloader`,
416+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.val_dataloaders`,
417+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.test_dataloaders`, and
418+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.predict_dataloaders`.
415419

416420

417421
----------------
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
:orphan:
2+
3+
Arbitrary iterable support
4+
==========================
5+
6+
Python iterables are objects that can be iterated or looped over. Examples of iterables in Python include lists and dictionaries.
7+
In PyTorch, a :class:`torch.utils.data.DataLoader` is also an iterable which typically retrieves data from a :class:`torch.utils.data.Dataset` or :class:`torch.utils.data.IterableDataset`.
8+
9+
The :class:`~lightning.pytorch.trainer.trainer.Trainer` works with arbitrary iterables, but most people will use a :class:`torch.utils.data.DataLoader` as the iterable to feed data to the model.
10+
11+
.. _multiple-dataloaders:
12+
13+
Multiple Iterables
14+
------------------
15+
16+
In addition to supporting arbitrary iterables, the ``Trainer`` also supports arbitrary collections of iterables. Some examples of this are:
17+
18+
.. code-block:: python
19+
20+
return DataLoader(...)
21+
return list(range(1000))
22+
23+
# pass loaders as a dict. This will create batches like this:
24+
# {'a': batch_from_loader_a, 'b': batch_from_loader_b}
25+
return {"a": DataLoader(...), "b": DataLoader(...)}
26+
27+
# pass loaders as list. This will create batches like this:
28+
# [batch_from_dl_1, batch_from_dl_2]
29+
return [DataLoader(...), DataLoader(...)]
30+
31+
# {'a': [batch_from_dl_1, batch_from_dl_2], 'b': [batch_from_dl_3, batch_from_dl_4]}
32+
return {"a": [dl1, dl2], "b": [dl3, dl4]}
33+
34+
Lightning automatically collates the batches from multiple iterables based on a "mode". This is done with our
35+
:class:`~lightning.pytorch.utilities.combined_loader.CombinedLoader` class.
36+
The list of modes available can be found by looking at the :paramref:`~lightning.pytorch.utilities.combined_loader.CombinedLoader.mode` documentation.
37+
38+
By default, the ``"max_size_cycle"`` mode is used during training and the ``"sequential"`` mode is used during validation, testing, and prediction.
39+
To choose a different mode, you can use the :class:`~lightning.pytorch.utilities.combined_loader.CombinedLoader` class directly with your mode of choice:
40+
41+
.. code-block:: python
42+
43+
from lightning.pytorch.utilities import CombinedLoader
44+
45+
iterables = {"a": DataLoader(), "b": DataLoader()}
46+
combined_loader = CombinedLoader(iterables, mode="min_size")
47+
model = ...
48+
trainer = Trainer()
49+
trainer.fit(model, combined_loader)
50+
51+
52+
Currently, ``trainer.validate``, ``trainer.test``, and ``trainer.predict`` methods only support the ``"sequential"`` mode, while ``trainer.fit`` method does not support it.
53+
Support for this feature is tracked in this `issue <https://github.com/Lightning-AI/lightning/issues/16830>`__.
54+
55+
Note that when using the ``"sequential"`` mode, you need to add an additional argument ``dataloader_idx`` to some specific hooks.
56+
Lightning will `raise an error <https://github.com/Lightning-AI/lightning/pull/16837>`__ informing you of this requirement.
57+
58+
Using LightningDataModule
59+
-------------------------
60+
61+
You can set more than one :class:`~torch.utils.data.DataLoader` in your :class:`~lightning.pytorch.core.datamodule.LightningDataModule` using its DataLoader hooks
62+
and Lightning will use the correct one.
63+
64+
.. testcode::
65+
66+
class DataModule(LightningDataModule):
67+
def train_dataloader(self):
68+
# any iterable or collection of iterables
69+
return DataLoader(self.train_dataset)
70+
71+
def val_dataloader(self):
72+
# any iterable or collection of iterables
73+
return [DataLoader(self.val_dataset_1), DataLoader(self.val_dataset_2)]
74+
75+
def test_dataloader(self):
76+
# any iterable or collection of iterables
77+
return DataLoader(self.test_dataset)
78+
79+
def predict_dataloader(self):
80+
# any iterable or collection of iterables
81+
return DataLoader(self.predict_dataset)
82+
83+
Using LightningModule Hooks
84+
---------------------------
85+
86+
The exact same code as above works when overriding :class:`~lightning.pytorch.core.module.LightningModule`
87+
88+
Passing the iterables to the Trainer
89+
------------------------------------
90+
91+
The same support for arbitrary iterables, or collection of iterables applies to the dataloader arguments of
92+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.fit`, :meth:`~lightning.pytorch.trainer.trainer.Trainer.validate`,
93+
:meth:`~lightning.pytorch.trainer.trainer.Trainer.test`, :meth:`~lightning.pytorch.trainer.trainer.Trainer.predict`

0 commit comments

Comments
 (0)