Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Commit 299a776

Browse files
authored
Support PyTorch Lightning 1.6 (#163)
1 parent 6aed848 commit 299a776

27 files changed

+1193
-760
lines changed

.github/workflows/test.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
python -m pip install --upgrade pip
3636
python -m pip install --upgrade setuptools
3737
python -m pip install codecov
38-
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
38+
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
3939
if [ -f requirements-test.txt ]; then python -m pip install -r requirements-test.txt; fi
4040
- name: Install package
4141
run: |
@@ -60,7 +60,7 @@ jobs:
6060
python -m pip install --upgrade pip
6161
python -m pip install --upgrade setuptools
6262
python -m pip install codecov
63-
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
63+
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
6464
if [ -f requirements-test.txt ]; then python -m pip install -r requirements-test.txt; fi
6565
HOROVOD_WITH_GLOO=1 HOROVOD_WITHOUT_MPI=1 HOROVOD_WITHOUT_MXNET=1 pip install git+https://github.com/horovod/horovod.git
6666
- name: Install package
@@ -86,7 +86,7 @@ jobs:
8686
python -m pip install --upgrade pip
8787
python -m pip install --upgrade setuptools
8888
python -m pip install codecov
89-
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
89+
python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
9090
if [ -f requirements-test.txt ]; then python -m pip install -r requirements-test.txt; fi
9191
HOROVOD_WITH_GLOO=1 HOROVOD_WITHOUT_MPI=1 HOROVOD_WITHOUT_MXNET=1 pip install git+https://github.com/horovod/horovod.git
9292
- name: Install package

README.md

Lines changed: 30 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
<!--$UNCOMMENT(ray-lightning)=-->
22

33
# Distributed PyTorch Lightning Training on Ray
4-
This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed computing framework.
4+
This library adds new PyTorch Lightning strategies for distributed training using the Ray distributed computing framework.
55

6-
These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
6+
These PyTorch Lightning strategies on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
77

8-
Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
8+
Once you add your strategy to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
99

1010
This library also comes with an integration with <!--$UNCOMMENT{ref}`Ray Tune <tune-main>`--><!--$REMOVE-->[Ray Tune](https://tune.io)<!--$END_REMOVE--> for distributed hyperparameter tuning experiments.
1111

@@ -39,29 +39,30 @@ Here are the supported PyTorch Lightning versions:
3939
|---|---|
4040
| 0.1 | 1.4 |
4141
| 0.2 | 1.5 |
42-
| master | 1.5 |
42+
| 0.3 | 1.6 |
43+
| master | 1.6 |
4344

4445

45-
## PyTorch Distributed Data Parallel Plugin on Ray
46-
The `RayPlugin` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
46+
## PyTorch Distributed Data Parallel Strategy on Ray
47+
The `RayStrategy` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
4748

4849
Here is a simplified example:
4950

5051
```python
5152
import pytorch_lightning as pl
52-
from ray_lightning import RayPlugin
53+
from ray_lightning import RayStrategy
5354

5455
# Create your PyTorch Lightning model here.
5556
ptl_model = MNISTClassifier(...)
56-
plugin = RayPlugin(num_workers=4, num_cpus_per_worker=1, use_gpu=True)
57+
strategy = RayStrategy(num_workers=4, num_cpus_per_worker=1, use_gpu=True)
5758

5859
# Don't set ``gpus`` in the ``Trainer``.
5960
# The actual number of GPUs is determined by ``num_workers``.
60-
trainer = pl.Trainer(..., plugins=[plugin])
61+
trainer = pl.Trainer(..., strategy=strategy)
6162
trainer.fit(ptl_model)
6263
```
6364

64-
Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this plugin even in cases when you cannot use the standard `DDPPlugin` such as
65+
Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this strategy even in cases when you cannot use the standard `DDPStrategy` such as
6566
- Jupyter Notebooks, Google Colab, Kaggle
6667
- Calling `fit` or `test` multiple times in the same script
6768

@@ -94,40 +95,40 @@ Now you can run your training script on the laptop, but have it execute as if yo
9495

9596
**Note:** When using with Ray Client, you must disable checkpointing and logging for your Trainer by setting `checkpoint_callback` and `logger` to `False`.
9697

97-
## Horovod Plugin on Ray
98-
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayPlugin` instead.
98+
## Horovod Strategy on Ray
99+
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayStrategy` instead.
99100

100101
```python
101102
import pytorch_lightning as pl
102-
from ray_lightning import HorovodRayPlugin
103+
from ray_lightning import HorovodRayStrategy
103104

104105
# Create your PyTorch Lightning model here.
105106
ptl_model = MNISTClassifier(...)
106107

107108
# 2 workers, 1 CPU and 1 GPU each.
108-
plugin = HorovodRayPlugin(num_workers=2, use_gpu=True)
109+
strategy = HorovodRayStrategy(num_workers=2, use_gpu=True)
109110

110111
# Don't set ``gpus`` in the ``Trainer``.
111112
# The actual number of GPUs is determined by ``num_workers``.
112-
trainer = pl.Trainer(..., plugins=[plugin])
113+
trainer = pl.Trainer(..., strategy=strategy)
113114
trainer.fit(ptl_model)
114115
```
115116

116117
## Model Parallel Sharded Training on Ray
117-
The `RayShardedPlugin` integrates with [FairScale](https://github.com/facebookresearch/fairscale) to provide sharded DDP training on a Ray cluster.
118+
The `RayShardedStrategy` integrates with [FairScale](https://github.com/facebookresearch/fairscale) to provide sharded DDP training on a Ray cluster.
118119
With sharded training, leverage the scalability of data parallel training while drastically reducing memory usage when training large models.
119120

120121
```python
121122
import pytorch_lightning as pl
122-
from ray_lightning import RayShardedPlugin
123+
from ray_lightning import RayShardedStrategy
123124

124125
# Create your PyTorch Lightning model here.
125126
ptl_model = MNISTClassifier(...)
126-
plugin = RayShardedPlugin(num_workers=4, num_cpus_per_worker=1, use_gpu=True)
127+
strategy = RayShardedStrategy(num_workers=4, num_cpus_per_worker=1, use_gpu=True)
127128

128129
# Don't set ``gpus`` in the ``Trainer``.
129130
# The actual number of GPUs is determined by ``num_workers``.
130-
trainer = pl.Trainer(..., plugins=[plugin])
131+
trainer = pl.Trainer(..., strategy=strategy)
131132
trainer.fit(ptl_model)
132133
```
133134
See the [Pytorch Lightning docs](https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#sharded-training) for more information on sharded training.
@@ -140,7 +141,7 @@ Example using `ray_lightning` with Tune:
140141
```python
141142
from ray import tune
142143

143-
from ray_lightning import RayPlugin
144+
from ray_lightning import RayStrategy
144145
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
145146
from ray_lightning.tune import TuneReportCallback, get_tune_resources
146147

@@ -159,7 +160,7 @@ def train_mnist(config):
159160
trainer = pl.Trainer(
160161
max_epochs=4,
161162
callbacks=callbacks,
162-
plugins=[RayPlugin(num_workers=4, use_gpu=False)])
163+
strategy=[RayStrategy(num_workers=4, use_gpu=False)])
163164
trainer.fit(model)
164165

165166
config = {
@@ -184,26 +185,29 @@ print("Best hyperparameters found were: ", analysis.best_config)
184185
**Note:** Ray Tune requires 1 additional CPU per trial to use for the Trainable driver. So the actual number of resources each trial requires is `num_workers * num_cpus_per_worker + 1`.
185186

186187
## FAQ
187-
> I see that `RayPlugin` is based off of Pytorch Lightning's `DDPSpawnPlugin`. However, doesn't the PTL team discourage the use of spawn?
188+
> I see that `RayStrategy` is based off of Pytorch Lightning's `DDPSpawnStrategy`. However, doesn't the PTL team discourage the use of spawn?
188189
189190
As discussed [here](https://github.com/pytorch/pytorch/issues/51688#issuecomment-773539003), using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:
190191
1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
191192
2. not being able to use multiple workers for data loading.
192193

193-
Neither of these should be an issue with the `RayPlugin` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this plugin, your model does have to be serializable/pickleable.
194+
Neither of these should be an issue with the `RayStrategy` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this strategy, your model does have to be serializable/pickleable.
195+
196+
> Horovod installation issue
197+
please see [details](./docs/horovod_faq.md)
194198

195199
<!--$UNCOMMENT## API Reference
196200
197201
```{eval-rst}
198-
.. autoclass:: ray_lightning.RayPlugin
202+
.. autoclass:: ray_lightning.RayStrategy
199203
```
200204
201205
```{eval-rst}
202-
.. autoclass:: ray_lightning.HorovodRayPlugin
206+
.. autoclass:: ray_lightning.HorovodRayStrategy
203207
```
204208
205209
```{eval-rst}
206-
.. autoclass:: ray_lightning.RayShardedPlugin
210+
.. autoclass:: ray_lightning.RayShardedStrategy
207211
```
208212
209213

docs/horovod_faq.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Horovod installation issue
2+
3+
> ```
4+
> Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
5+
> If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
6+
>Warning! MPI libs are missing, but python applications are still avaiable.
7+
> ```
8+
9+
One might fix this issue by
10+
```python
11+
$ pip uninstall -y horovod
12+
$ conda install gcc_linux-64 gxx_linux-64
13+
$ [flags] pip install --no-cache-dir horovod
14+
```
15+
16+
from [here](https://github.com/horovod/horovod/issues/656), [here](https://github.com/tlkh/ai-lab/issues/27) and [here](https://horovod.readthedocs.io/en/stable/install_include.html)
17+
18+
- install horovod from scratch with torch
19+
20+
```python
21+
conda create -n hd python=3.8 scipy numpy pandas -y
22+
conda activate hd
23+
conda install pytorch=1.11 torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
24+
sudo rm -rf /usr/local/cuda
25+
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
26+
conda install gxx_linux-64 -y
27+
conda install cxx-compiler=1.0 -y
28+
export TORCH_CUDA_ARCH_LIST="3.7;5.0;6.0;7.0;7.5;8.0"
29+
echo $TORCH_CUDA_ARCH_LIST
30+
sudo apt-get purge -y cmake
31+
wget -q https://github.com/Kitware/CMake/releases/download/v3.20.2/cmake-3.20.2.tar.gz
32+
tar -zxvf cmake-3.20.2.tar.gz
33+
cd cmake-3.20.2
34+
./bootstrap -- -DCMAKE_USE_OPENSSL=OFF
35+
make -j10
36+
sudo make install
37+
cmake --version
38+
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
39+
export HOROVOD_NCCL_HOME=/usr/local/cuda/
40+
export HOROVOD_NCCL_INCLUDE=/usr/local/cuda/include
41+
export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST//";8.0"/}
42+
export HOROVOD_BUILD_CUDA_CC_LIST=${TORCH_CUDA_ARCH_LIST//";"/","}
43+
export HOROVOD_BUILD_CUDA_CC_LIST=${HOROVOD_BUILD_CUDA_CC_LIST//"."/""}
44+
export PATH=/usr/local/cuda/bin/:$PATH
45+
export HOROVOD_NCCL_LIB=/usr/local/cuda/lib/
46+
HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_GLOO=1 pip install --no-cache-dir horovod
47+
```
48+
49+
[reference 1](https://stackoverflow.com/questions/54948216/usr-lib-x86-64-linux-gnu-libstdc-so-6-version-glibcxx-3-4-21-not-found-req) and [reference 2](https://github.com/horovod/horovod/issues/401) and [reference 3](https://github.com/Lightning-AI/lightning/issues/4472) and [reference 4](https://github.com/horovod/horovod/issues/2276) and [reference 5](https://github.com/Lightning-AI/lightning/blob/master/dockers/base-cuda/Dockerfile#L105-L121) and [reference 6](https://horovod.readthedocs.io/en/stable/gpus_include.html) and [reference 7](https://horovod.readthedocs.io/en/stable/conda_include.html) and [reference 8](https://github.com/horovod/horovod/issues/3545) and [reference 9](https://github.com/KAUST-CTL/horovod-gpu-data-science-project) and [reference 10](https://kose-y.github.io/blog/2017/12/installing-cuda-aware-mpi/)

ray_lightning/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
from ray_lightning.ray_ddp import RayPlugin
2-
from ray_lightning.ray_horovod import HorovodRayPlugin
3-
from ray_lightning.ray_ddp_sharded import RayShardedPlugin
1+
from ray_lightning.ray_ddp import RayStrategy
2+
from ray_lightning.ray_horovod import HorovodRayStrategy
3+
from ray_lightning.ray_ddp_sharded import RayShardedStrategy
44

5-
__all__ = ["RayPlugin", "HorovodRayPlugin", "RayShardedPlugin"]
5+
__all__ = ["RayStrategy", "HorovodRayStrategy", "RayShardedStrategy"]
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Copyright The PyTorch Lightning team.
2+
# Licensed under the Apache License, Version 2.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
#
6+
# http://www.apache.org/licenses/LICENSE-2.0
7+
#
8+
# Unless required by applicable law or agreed to in writing, software
9+
# distributed under the License is distributed on an "AS IS" BASIS,
10+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
# See the License for the specific language governing permissions and
12+
# limitations under the License.
13+
from pytorch_lightning.accelerators.registry import \
14+
call_register_accelerators # noqa: F401
15+
from ray_lightning.accelerators.delayed_gpu_accelerator import _GPUAccelerator
16+
17+
# these lines are to register the delayed gpu accelerator as `_gpu`
18+
ACCELERATORS_BASE_MODULE = "ray_lightning.accelerators"
19+
call_register_accelerators(ACCELERATORS_BASE_MODULE)
20+
21+
__all__ = ["_GPUAccelerator"]
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Copyright The PyTorch Lightning team.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import Dict, List
15+
16+
import torch
17+
18+
from pytorch_lightning.accelerators import Accelerator,\
19+
GPUAccelerator
20+
21+
22+
class _GPUAccelerator(GPUAccelerator):
23+
"""Accelerator for GPU devices.
24+
25+
adapted from:
26+
https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/accelerators/gpu.py#L43
27+
but remove `torch.cuda.set_device(root_device)` in `setup_environment`
28+
"""
29+
30+
def setup_environment(self, root_device: torch.device) -> None:
31+
"""
32+
modified: remove `torch.cuda.set_device(root_device)`
33+
and call `torch.cuda.set_device(self.device)` at the later time
34+
inside the `ray_launcher` or `horovod_launcher`
35+
"""
36+
Accelerator.setup_environment(self, root_device)
37+
38+
@staticmethod
39+
def get_parallel_devices(devices: List[int]) -> List[torch.device]:
40+
"""Gets parallel devices for the Accelerator."""
41+
# modified: return None when no devices are available
42+
if devices:
43+
return [torch.device("cuda", i) for i in devices]
44+
else:
45+
return None
46+
47+
@staticmethod
48+
def is_available() -> bool:
49+
# modified to always return True
50+
return True
51+
52+
@classmethod
53+
def register_accelerators(cls, accelerator_registry: Dict) -> None:
54+
# the delayed gpu accelerator is registered as `_gpu`
55+
# in the accelerator registry
56+
accelerator_registry.register(
57+
"_gpu",
58+
cls,
59+
description=f"{cls.__class__.__name__}",
60+
)

ray_lightning/examples/ray_ddp_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import ray
1212
from ray import tune
1313
from ray_lightning.tune import TuneReportCallback, get_tune_resources
14-
from ray_lightning import RayPlugin
14+
from ray_lightning import RayStrategy
1515
from ray_lightning.tests.utils import LightningMNISTClassifier
1616

1717

@@ -73,7 +73,7 @@ def train_mnist(config,
7373
trainer = pl.Trainer(
7474
max_epochs=num_epochs,
7575
callbacks=callbacks,
76-
plugins=[RayPlugin(num_workers=num_workers, use_gpu=use_gpu)],
76+
strategy=RayStrategy(num_workers=num_workers, use_gpu=use_gpu),
7777
**trainer_kwargs)
7878
trainer.fit(model)
7979

ray_lightning/examples/ray_ddp_sharded_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
import pytorch_lightning as pl
1111
from pytorch_lightning import Callback
1212

13-
from ray_lightning import RayShardedPlugin
13+
from ray_lightning import RayShardedStrategy
1414

1515

1616
class CUDACallback(Callback):
@@ -53,7 +53,7 @@ def download_data():
5353
with FileLock(os.path.join(data_dir, ".lock")):
5454
MNISTDataModule(data_dir=data_dir).prepare_data()
5555

56-
plugin = RayShardedPlugin(
56+
strategy = RayShardedStrategy(
5757
num_workers=num_workers, use_gpu=use_gpu, init_hook=download_data)
5858

5959
dm = MNISTDataModule(data_dir, batch_size=batch_size)
@@ -65,7 +65,7 @@ def download_data():
6565
max_epochs=max_epochs,
6666
precision=16 if use_gpu else 32,
6767
callbacks=[CUDACallback()] if use_gpu else [],
68-
plugins=plugin,
68+
strategy=strategy,
6969
max_steps=max_steps)
7070

7171
trainer.fit(model, dm)

ray_lightning/examples/ray_ddp_tune.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import ray
99
from ray import tune
1010
from ray_lightning.tune import TuneReportCallback, get_tune_resources
11-
from ray_lightning import RayPlugin
11+
from ray_lightning import RayStrategy
1212
from ray_lightning.tests.utils import LightningMNISTClassifier
1313

1414

@@ -32,12 +32,8 @@ def download_data():
3232
max_epochs=num_epochs,
3333
callbacks=callbacks,
3434
progress_bar_refresh_rate=0,
35-
plugins=[
36-
RayPlugin(
37-
num_workers=num_workers,
38-
use_gpu=use_gpu,
39-
init_hook=download_data)
40-
])
35+
strategy=RayStrategy(
36+
num_workers=num_workers, use_gpu=use_gpu, init_hook=download_data))
4137
dm = MNISTDataModule(
4238
data_dir=data_dir, num_workers=1, batch_size=config["batch_size"])
4339
trainer.fit(model, dm)

0 commit comments

Comments
 (0)