Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Commit afc7fb4

Browse files
authored
Update README to render on Ray docs (#135)
Adjust the readme file to take advantage of ray-project/ray#23505
1 parent 3adb809 commit afc7fb4

File tree

1 file changed

+53
-14
lines changed

1 file changed

+53
-14
lines changed

README.md

Lines changed: 53 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
1+
<!--$UNCOMMENT(ray-lightning)=-->
2+
13
# Distributed PyTorch Lightning Training on Ray
24
This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed computing framework.
35

46
These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either [PyTorch Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or [Horovod](https://github.com/horovod/horovod).
57

68
Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
79

8-
This library also comes with an integration with [Ray Tune](tune.io) for distributed hyperparameter tuning experiments.
10+
This library also comes with an integration with <!--$UNCOMMENT{ref}`Ray Tune <tune-main>`--><!--$REMOVE-->[Ray Tune](https://tune.io)<!--$END_REMOVE--> for distributed hyperparameter tuning experiments.
911

12+
<!--$REMOVE-->
1013
# Table of Contents
1114
1. [Installation](#installation)
1215
2. [PyTorch Lightning Compatibility](#pytorch-lightning-compatibility)
@@ -17,6 +20,7 @@ This library also comes with an integration with [Ray Tune](tune.io) for distrib
1720
6. [Model Parallel Sharded Training on Ray](#model-parallel-sharded-training-on-ray)
1821
7. [Hyperparameter Tuning with Ray Tune](#hyperparameter-tuning-with-ray-tune)
1922
8. [FAQ](#faq)
23+
<!--$END_REMOVE-->
2024

2125

2226
## Installation
@@ -62,18 +66,27 @@ Because Ray is used to launch processes, instead of the same script being called
6266
- Calling `fit` or `test` multiple times in the same script
6367

6468
## Multi-node Distributed Training
65-
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
66-
1) [Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
67-
2) [Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
69+
Using the same examples above, you can run distributed training on a multi-node cluster with just a couple simple steps.
70+
71+
First, use Ray's <!--$UNCOMMENT{ref}`Cluster launcher <ref-cluster-quick-start>`--><!--$REMOVE-->[Cluster launcher](https://docs.ray.io/en/latest/cluster/quickstart.html)<!--$END_REMOVE--> to start a Ray cluster:
72+
73+
.. code-block:: bash
6874

69-
You no longer have to set environment variables or configurations and run your training script on every single node.
75+
ray up my_cluster_config.yaml
76+
77+
Then, run your Ray script using one of the following options:
78+
79+
1. on the head node of the cluster (``python train_script.py``)
80+
2. via ``ray job submit`` (<!--$UNCOMMENT{ref}`docs <jobs-overview>`--><!--$REMOVE-->[docs](https://docs.ray.io/en/latest/cluster/job-submission.html)<!--$END_REMOVE-->) from your laptop (``ray job submit -- python train.py``)
7081

7182
## Multi-node Training from your Laptop
72-
Ray provides capabilities to run multi-node and GPU training all from your laptop through [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)
83+
Ray provides capabilities to run multi-node and GPU training all from your laptop through
84+
<!--$UNCOMMENT{ref}`Ray Client <ray-client>`--><!--$REMOVE-->[Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)<!--$END_REMOVE-->
7385

74-
You can follow the instructions [here](https://docs.ray.io/en/master/cluster/ray-client.html) to setup the cluster.
86+
Ray's <!--$UNCOMMENT{ref}`Cluster launcher <ref-cluster-quick-start>`--><!--$REMOVE-->[Cluster launcher](https://docs.ray.io/en/latest/cluster/quickstart.html)<!--$END_REMOVE--> to setup the cluster.
7587
Then, add this line to the beginning of your script to connect to the cluster:
7688
```python
89+
import ray
7790
# replace with the appropriate host and port
7891
ray.init("ray://<head_node_host>:10001")
7992
```
@@ -128,8 +141,12 @@ Example using `ray_lightning` with Tune:
128141
from ray import tune
129142

130143
from ray_lightning import RayPlugin
144+
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
131145
from ray_lightning.tune import TuneReportCallback, get_tune_resources
132146

147+
import pytorch_lightning as pl
148+
149+
133150
def train_mnist(config):
134151

135152
# Create your PTL model.
@@ -158,7 +175,7 @@ analysis = tune.run(
158175
metric="loss",
159176
mode="min",
160177
config=config,
161-
num_samples=num_samples,
178+
num_samples=2,
162179
resources_per_trial=get_tune_resources(num_workers=4),
163180
name="tune_mnist")
164181

@@ -167,16 +184,38 @@ print("Best hyperparameters found were: ", analysis.best_config)
167184
**Note:** Ray Tune requires 1 additional CPU per trial to use for the Trainable driver. So the actual number of resources each trial requires is `num_workers * num_cpus_per_worker + 1`.
168185

169186
## FAQ
170-
> RaySGD already has a [Pytorch Lightning integration](https://docs.ray.io/en/master/raysgd/raysgd_ptl.html). What's the difference between this integration and that?
171-
172-
The key difference is which Trainer you'll be interacting with. In this library, you will still be using Pytorch Lightning's `Trainer`. You'll be able to leverage all the features of Pytorch Lightning, and Ray is used just as a backend to handle distributed training.
173-
174-
With RaySGD's integration, you'll be converting your `LightningModule` to be RaySGD compatible, and will be interacting with RaySGD's `TorchTrainer`. RaySGD's `TorchTrainer` is not as feature rich nor as easy to use as Pytorch Lightning's `Trainer` (no built in support for logging, early stopping, etc.). However, it does have built in support for fault-tolerant and elastic training. If these are hard requirements for you, then RaySGD's integration with PTL might be a better option.
175-
176187
> I see that `RayPlugin` is based off of Pytorch Lightning's `DDPSpawnPlugin`. However, doesn't the PTL team discourage the use of spawn?
177188
178189
As discussed [here](https://github.com/pytorch/pytorch/issues/51688#issuecomment-773539003), using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:
179190
1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
180191
2. not being able to use multiple workers for data loading.
181192

182193
Neither of these should be an issue with the `RayPlugin` due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this plugin, your model does have to be serializable/pickleable.
194+
195+
<!--$UNCOMMENT## API Reference
196+
197+
```{eval-rst}
198+
.. autoclass:: ray_lightning.RayPlugin
199+
```
200+
201+
```{eval-rst}
202+
.. autoclass:: ray_lightning.HorovodRayPlugin
203+
```
204+
205+
```{eval-rst}
206+
.. autoclass:: ray_lightning.RayShardedPlugin
207+
```
208+
209+
210+
### Tune Integration
211+
```{eval-rst}
212+
.. autoclass:: ray_lightning.tune.TuneReportCallback
213+
```
214+
215+
```{eval-rst}
216+
.. autoclass:: ray_lightning.tune.TuneReportCheckpointCallback
217+
```
218+
219+
```{eval-rst}
220+
.. autofunction:: ray_lightning.tune.get_tune_resources
221+
```-->

0 commit comments

Comments
 (0)