You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+30-26Lines changed: 30 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
<!--$UNCOMMENT(ray-lightning)=-->
2
2
3
3
# Distributed PyTorch Lightning Training on Ray
4
-
This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed computing framework.
4
+
This library adds new PyTorch Lightning strategies for distributed training using the Ray distributed computing framework.
5
5
6
-
These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
6
+
These PyTorch Lightning strategies on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
7
7
8
-
Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
8
+
Once you add your strategy to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
9
9
10
10
This library also comes with an integration with <!--$UNCOMMENT{ref}`Ray Tune <tune-main>`--><!--$REMOVE-->[Ray Tune](https://tune.io)<!--$END_REMOVE--> for distributed hyperparameter tuning experiments.
11
11
@@ -39,29 +39,30 @@ Here are the supported PyTorch Lightning versions:
39
39
|---|---|
40
40
| 0.1 | 1.4 |
41
41
| 0.2 | 1.5 |
42
-
| master | 1.5 |
42
+
| 0.3 | 1.6 |
43
+
| master | 1.6 |
43
44
44
45
45
-
## PyTorch Distributed Data Parallel Plugin on Ray
46
-
The `RayPlugin` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
46
+
## PyTorch Distributed Data Parallel Strategy on Ray
47
+
The `RayStrategy` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
# The actual number of GPUs is determined by ``num_workers``.
60
-
trainer = pl.Trainer(..., plugins=[plugin])
61
+
trainer = pl.Trainer(..., strategy=strategy)
61
62
trainer.fit(ptl_model)
62
63
```
63
64
64
-
Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this plugin even in cases when you cannot use the standard `DDPPlugin` such as
65
+
Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this strategy even in cases when you cannot use the standard `DDPStrategy` such as
65
66
- Jupyter Notebooks, Google Colab, Kaggle
66
67
- Calling `fit` or `test` multiple times in the same script
67
68
@@ -94,40 +95,40 @@ Now you can run your training script on the laptop, but have it execute as if yo
94
95
95
96
**Note:** When using with Ray Client, you must disable checkpointing and logging for your Trainer by setting `checkpoint_callback` and `logger` to `False`.
96
97
97
-
## Horovod Plugin on Ray
98
-
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayPlugin` instead.
98
+
## Horovod Strategy on Ray
99
+
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayStrategy` instead.
# The actual number of GPUs is determined by ``num_workers``.
130
-
trainer = pl.Trainer(..., plugins=[plugin])
131
+
trainer = pl.Trainer(..., strategy=strategy)
131
132
trainer.fit(ptl_model)
132
133
```
133
134
See the [Pytorch Lightning docs](https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#sharded-training) for more information on sharded training.
@@ -140,7 +141,7 @@ Example using `ray_lightning` with Tune:
140
141
```python
141
142
from ray import tune
142
143
143
-
from ray_lightning importRayPlugin
144
+
from ray_lightning importRayStrategy
144
145
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
145
146
from ray_lightning.tune import TuneReportCallback, get_tune_resources
@@ -184,26 +185,29 @@ print("Best hyperparameters found were: ", analysis.best_config)
184
185
**Note:** Ray Tune requires 1 additional CPU per trial to use for the Trainable driver. So the actual number of resources each trial requires is `num_workers * num_cpus_per_worker + 1`.
185
186
186
187
## FAQ
187
-
> I see that `RayPlugin` is based off of Pytorch Lightning's `DDPSpawnPlugin`. However, doesn't the PTL team discourage the use of spawn?
188
+
> I see that `RayStrategy` is based off of Pytorch Lightning's `DDPSpawnStrategy`. However, doesn't the PTL team discourage the use of spawn?
188
189
189
190
As discussed [here](https://github.com/pytorch/pytorch/issues/51688#issuecomment-773539003), using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:
190
191
1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
191
192
2. not being able to use multiple workers for data loading.
192
193
193
-
Neither of these should be an issue with the `RayPlugin` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this plugin, your model does have to be serializable/pickleable.
194
+
Neither of these should be an issue with the `RayStrategy` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this strategy, your model does have to be serializable/pickleable.
> Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
5
+
> If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
6
+
>Warning! MPI libs are missing, but python applications are still avaiable.
7
+
> ```
8
+
9
+
One might fix this issue by
10
+
```python
11
+
$ pip uninstall -y horovod
12
+
$ conda install gcc_linux-64 gxx_linux-64
13
+
$ [flags] pip install --no-cache-dir horovod
14
+
```
15
+
16
+
from [here](https://github.com/horovod/horovod/issues/656), [here](https://github.com/tlkh/ai-lab/issues/27) and [here](https://horovod.readthedocs.io/en/stable/install_include.html)
17
+
18
+
- install horovod from scratch with torch
19
+
20
+
```python
21
+
conda create -n hd python=3.8 scipy numpy pandas -y
[reference 1](https://stackoverflow.com/questions/54948216/usr-lib-x86-64-linux-gnu-libstdc-so-6-version-glibcxx-3-4-21-not-found-req) and [reference 2](https://github.com/horovod/horovod/issues/401) and [reference 3](https://github.com/Lightning-AI/lightning/issues/4472) and [reference 4](https://github.com/horovod/horovod/issues/2276) and [reference 5](https://github.com/Lightning-AI/lightning/blob/master/dockers/base-cuda/Dockerfile#L105-L121) and [reference 6](https://horovod.readthedocs.io/en/stable/gpus_include.html) and [reference 7](https://horovod.readthedocs.io/en/stable/conda_include.html) and [reference 8](https://github.com/horovod/horovod/issues/3545) and [reference 9](https://github.com/KAUST-CTL/horovod-gpu-data-science-project) and [reference 10](https://kose-y.github.io/blog/2017/12/installing-cuda-aware-mpi/)
0 commit comments