You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+53-14Lines changed: 53 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,15 @@
1
+
<!--$UNCOMMENT(ray-lightning)=-->
2
+
1
3
# Distributed PyTorch Lightning Training on Ray
2
4
This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed computing framework.
3
5
4
6
These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
5
7
6
8
Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
7
9
8
-
This library also comes with an integration with [Ray Tune](tune.io) for distributed hyperparameter tuning experiments.
10
+
This library also comes with an integration with <!--$UNCOMMENT{ref}`Ray Tune <tune-main>`--><!--$REMOVE-->[Ray Tune](https://tune.io)<!--$END_REMOVE--> for distributed hyperparameter tuning experiments.
@@ -17,6 +20,7 @@ This library also comes with an integration with [Ray Tune](tune.io) for distrib
17
20
6.[Model Parallel Sharded Training on Ray](#model-parallel-sharded-training-on-ray)
18
21
7.[Hyperparameter Tuning with Ray Tune](#hyperparameter-tuning-with-ray-tune)
19
22
8.[FAQ](#faq)
23
+
<!--$END_REMOVE-->
20
24
21
25
22
26
## Installation
@@ -62,18 +66,27 @@ Because Ray is used to launch processes, instead of the same script being called
62
66
- Calling `fit` or `test` multiple times in the same script
63
67
64
68
## Multi-node Distributed Training
65
-
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
66
-
1)[Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
67
-
2)[Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
69
+
Using the same examples above, you can run distributed training on a multi-node cluster with just a couple simple steps.
70
+
71
+
First, use Ray's <!--$UNCOMMENT{ref}`Cluster launcher <ref-cluster-quick-start>`--><!--$REMOVE-->[Cluster launcher](https://docs.ray.io/en/latest/cluster/quickstart.html)<!--$END_REMOVE--> to start a Ray cluster:
72
+
73
+
.. code-block:: bash
68
74
69
-
You no longer have to set environment variables or configurations and run your training script on every single node.
75
+
ray up my_cluster_config.yaml
76
+
77
+
Then, run your Ray script using one of the following options:
78
+
79
+
1. on the head node of the cluster (``python train_script.py``)
80
+
2. via ``ray job submit`` (<!--$UNCOMMENT{ref}`docs <jobs-overview>`--><!--$REMOVE-->[docs](https://docs.ray.io/en/latest/cluster/job-submission.html)<!--$END_REMOVE-->) from your laptop (``ray job submit -- python train.py``)
70
81
71
82
## Multi-node Training from your Laptop
72
-
Ray provides capabilities to run multi-node and GPU training all from your laptop through [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)
83
+
Ray provides capabilities to run multi-node and GPU training all from your laptop through
You can follow the instructions [here](https://docs.ray.io/en/master/cluster/ray-client.html) to setup the cluster.
86
+
Ray's <!--$UNCOMMENT{ref}`Cluster launcher <ref-cluster-quick-start>`--><!--$REMOVE-->[Cluster launcher](https://docs.ray.io/en/latest/cluster/quickstart.html)<!--$END_REMOVE--> to setup the cluster.
75
87
Then, add this line to the beginning of your script to connect to the cluster:
76
88
```python
89
+
import ray
77
90
# replace with the appropriate host and port
78
91
ray.init("ray://<head_node_host>:10001")
79
92
```
@@ -128,8 +141,12 @@ Example using `ray_lightning` with Tune:
128
141
from ray import tune
129
142
130
143
from ray_lightning import RayPlugin
144
+
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
131
145
from ray_lightning.tune import TuneReportCallback, get_tune_resources
@@ -167,16 +184,38 @@ print("Best hyperparameters found were: ", analysis.best_config)
167
184
**Note:** Ray Tune requires 1 additional CPU per trial to use for the Trainable driver. So the actual number of resources each trial requires is `num_workers * num_cpus_per_worker + 1`.
168
185
169
186
## FAQ
170
-
> RaySGD already has a [Pytorch Lightning integration](https://docs.ray.io/en/master/raysgd/raysgd_ptl.html). What's the difference between this integration and that?
171
-
172
-
The key difference is which Trainer you'll be interacting with. In this library, you will still be using Pytorch Lightning's `Trainer`. You'll be able to leverage all the features of Pytorch Lightning, and Ray is used just as a backend to handle distributed training.
173
-
174
-
With RaySGD's integration, you'll be converting your `LightningModule` to be RaySGD compatible, and will be interacting with RaySGD's `TorchTrainer`. RaySGD's `TorchTrainer` is not as feature rich nor as easy to use as Pytorch Lightning's `Trainer` (no built in support for logging, early stopping, etc.). However, it does have built in support for fault-tolerant and elastic training. If these are hard requirements for you, then RaySGD's integration with PTL might be a better option.
175
-
176
187
> I see that `RayPlugin` is based off of Pytorch Lightning's `DDPSpawnPlugin`. However, doesn't the PTL team discourage the use of spawn?
177
188
178
189
As discussed [here](https://github.com/pytorch/pytorch/issues/51688#issuecomment-773539003), using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:
179
190
1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
180
191
2. not being able to use multiple workers for data loading.
181
192
182
193
Neither of these should be an issue with the `RayPlugin` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this plugin, your model does have to be serializable/pickleable.
0 commit comments