Skip to content

Commit 3290ec3

Browse files
aivanoufacebook-github-bot
authored andcommitted
Address train and dist docs comments (#292)
Summary: Pull Request resolved: #292 Address train and dist docs comments Reviewed By: kiukchung Differential Revision: D31775323 fbshipit-source-id: aff35aee04d9675fd5ab81afc81e2d52c032b134
1 parent e94b27e commit 3290ec3

File tree

3 files changed

+93
-106
lines changed

3 files changed

+93
-106
lines changed

torchx/components/dist.py

Lines changed: 87 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
# LICENSE file in the root directory of this source tree.
66

77
"""
8-
Torchx helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train`
8+
TorchX helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train`
99
on the example of running single trainer job. Here we will be using
1010
the same :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`.
1111
but will run it in a distributed manner.
1212
13-
Torchx uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes
13+
TorchX uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes
1414
and expects that user applications will be written in
1515
`Distributed data parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
1616
manner
@@ -26,53 +26,96 @@
2626
The component defines how the user application is launched and torchx will take care of translating this into
2727
scheduler-specific definitions.
2828
29-
.. note:
29+
.. note:: Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples`
30+
before running the examples
3031
31-
Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` to
32-
before running the examples
33-
34-
Single node, multiple trainers (desktop)
32+
Single node, multiple trainers
3533
=========================================
3634
37-
Try launching a single node, multiple trainers example on your desktop:
35+
Try launching a single node, multiple trainers example:
3836
39-
.. code:: bash
37+
.. code:: shell-session
38+
39+
$ torchx run \\
40+
-s local_cwd \\
41+
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
42+
--nnodes 1 \\
43+
--nproc_per_node 2 \\
44+
--rdzv_backend c10d --rdzv_endpoint localhost:29500
4045
41-
torchx run -s local_cwd \
42-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
43-
--nnodes 1 --nproc_per_node 2 \
44-
--rdzv_backend c10d --rdzv_endpoint localhost:29500
4546
46-
The `./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist` is reference to the component
47+
.. note:: Use ``torchx runopts`` to see available schedulers
48+
49+
The ``./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` is reference to the component
4750
function: :ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component`
4851
4952
50-
Single node, multiple trainers (kubernetes)
51-
============================================
53+
.. note:: TorchX supports docker scheduler via ``-s local_docker`` command that currently works only for single
54+
node multiple processes due to issues `286 <https://github.com/pytorch/torchx/issues/286>`_ and
55+
`287 <https://github.com/pytorch/torchx/issues/287>`_
5256
5357
54-
Now lets launch the same component on the kubernetes cluster.
55-
Check out :py:mod:`torchx.schedulers.kubernetes_scheduler` on dependencies that needs to be installed
56-
before running using `kubernetes` scheduler.
5758
59+
Multiple nodes, multiple trainers
60+
===================================
5861
59-
We can use the following cmd to launch application on kubernetes:
62+
It is simple to launch and manage distributed trainer with torchx. Lets try out and launch distributed
63+
trainer using docker. The following cmd to launch distributed job on docker:
6064
61-
.. code:: bash
65+
.. code:: shell-session
66+
67+
Launch the trainer using torchx
68+
$ torchx run \\
69+
-s local_cwd \\
70+
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
71+
--nnodes 2 \\
72+
--nproc_per_node 2 \\
73+
--rdzv_backend c10d \\
74+
--rdzv_endpoint localhost:29500
75+
76+
77+
.. note:: The command above will only work for hosts without GPU!
78+
79+
80+
This will run 4 trainers on two docker containers. ``local_docker`` assigns ``hostname``
81+
to each of the container role using the pattern ``${$APP_NAME}-${ROLE_NAME}-${RELICA_ID}``.
82+
83+
84+
TorchX supports ``kubernetes`` scheduler that allows you to execute distributed jobs on your kubernetes cluster.
85+
86+
87+
.. note:: Make sure that you install necessary dependencies on :ref:`schedulers/kubernetes:Prerequisites`
88+
before executing job
89+
90+
91+
The following command runs 2 pods on kubernetes cluster, each of the pods will occupy a single gpu.
92+
93+
94+
.. code:: shell-session
6295
63-
torchx run -s kubernetes --scheduler_args namespace=default,queue=default \
64-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
65-
--nnodes 1 --nproc_per_node 2
96+
$ torchx run -s kubernetes \\
97+
--scheduler_args namespace=default,queue=default \\
98+
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
99+
--nnodes 2 \\
100+
--nproc_per_node 1 \\
101+
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
66102
67-
The `namespaces` arg corresponds to the kubernetes namespace that you want to launch.
68-
The `queue` arg is the volcano `queue <https://volcano.sh/en/docs/queue/>`_.
69103
104+
The command above will launch distributed train job on kubernetes ``default`` namespace using volcano
105+
``default`` queue. In this example we used ``etcd`` rendezvous in comparison to the ``c10d`` rendezvous.
106+
It is important to use ``etcd`` rendezvous that uses ``etcd server`` since it is a best practice to perform
107+
peer discovery for distributed jobs. Read more about
108+
`rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
70109
71-
Example of output:
110+
111+
.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and
112+
change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist``
113+
method. Modify ``resource_def`` to the number of GPUs that your host has.
114+
115+
The command should produce the following output:
72116
73117
.. code:: bash
74118
75-
kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
76119
torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
77120
torchx 2021-10-18 18:46:55 INFO AppStatus:
78121
msg: <NONE>
@@ -87,48 +130,24 @@
87130
88131
You can use the job url to query the status or logs of the job:
89132
90-
.. code:: bash
133+
.. code:: shell-session
91134
92-
torchx status kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
135+
Change value to your unique app handle
136+
$ export APP_HANDLE=kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
137+
138+
$ torchx status $APP_HANDLE
93139
94140
torchx 2021-10-18 18:47:44 INFO AppDef:
95141
State: SUCCEEDED
96142
Num Restarts: -1
97143
Roles:
98144
*worker[0]:SUCCEEDED
99145
100-
Try running `torchx log kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng`.
101-
102-
103-
Multiple nodes, multiple trainers (kubernetes)
104-
===============================================
105-
106-
It is simple to launch multiple nodes trainer in kubernetes:
107-
108-
.. code:: bash
109-
110-
torchx run -s kubernetes --scheduler_args namespace=default,queue=default \
111-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
112-
--nnodes 2 --nproc_per_node 2
113-
114-
The command above will launch distributed train job on kubernetes `default` namespace using volcano
115-
`default` queue. It will use etcd service accessible on `etcd-server:2379` to perform
116-
`etcd rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
117-
118-
You can overwrite rendezvous endpoint:
119-
120-
.. code:: bash
146+
Try running
121147
122-
torchx run -s kubernetes --scheduler_args namespace=default,queue=default \
123-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
124-
--nnodes 2 --nproc_per_node 1 \
125-
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
148+
.. code:: shell-session
126149
127-
.. note:: For GPU training, keep `nproc_per_node` equal to the amount of GPUs the instace has.
128-
129-
The command above will launch distributed train job on kubernetes `default` namespace using volcano
130-
`default` queue. It will use etcd service accessible on `etcd-server:2379` to perform
131-
`etcd rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
150+
$ torchx log $APP_HANDLE
132151
133152
134153
Builtin distributed components
@@ -137,7 +156,7 @@
137156
In the examples above we used custom components to launch user applications. It is not always the case that
138157
users need to write their own components.
139158
140-
Torchx comes with set of builtin component that describe typical execution patterns.
159+
TorchX comes with set of builtin component that describe typical execution patterns.
141160
142161
143162
dist.ddp
@@ -186,24 +205,25 @@ def compute_world_size():
186205
187206
We can run this example on desktop on four processes using the following cmd:
188207
189-
.. code:: bash
208+
.. code:: shell-session
190209
191-
torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4
210+
$ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4
192211
193212
194213
Single trainer on kubernetes cluster
195214
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
196215
197216
We can execute it on the kubernetes cluster
198217
199-
.. code:: bash
200-
201-
torchx run -s kubernetes --scheduler_args namespace=default,queue=default\
202-
dist.ddp --entrypoint main.py --nproc_per_node 4
218+
.. code:: shell-session
203219
220+
$ torchx run -s kubernetes \\
221+
--scheduler_args namespace=default,queue=default \\
222+
dist.ddp --entrypoint main.py --nproc_per_node 4
204223
205224
206225
Components APIs
226+
-----------------
207227
"""
208228

209229
from typing import Any, Dict, Optional

torchx/components/train.py

Lines changed: 3 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -11,44 +11,9 @@
1111
generic components you can use to run your custom training app.
1212
1313
14-
.. note::
15-
16-
Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples`
17-
before running the examples
18-
19-
20-
Check out the code for :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`.
21-
You can try it out by running a single trainer example on your desktop:
22-
23-
24-
.. code:: bash
25-
26-
python torchx/examples/apps/lightning_classy_vision/train.py
27-
28-
29-
Torchx simplifies application execution by providing a simple to use APIs that standardize
30-
application execution on local or remote environments. It does this by introducing a concept of a
31-
Component.
32-
33-
Each user application should be accompanied with the corresponding component.
34-
Check out the single node trainer code:
35-
:ref:`examples_apps/lightning_classy_vision/component:Trainer Component`
36-
37-
Try it out yourself:
38-
39-
.. code:: bash
40-
41-
torchx run -s local_cwd \
42-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer
43-
44-
45-
The code above will execute a single trainer on a user desktop.
46-
If you have docker installed on your laptop you can running the same single trainer via the following cmd:
47-
48-
.. code:: bash
49-
50-
torchx run -s local_docker \
51-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer
14+
1. :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`
15+
2. :ref:`examples_apps/lightning_classy_vision/component:Trainer Component`
16+
3. :ref:`component_best_practices:Component Best Practices`
5217
5318
5419
You can learn more about authoring your own components: :py:mod:`torchx.components`

torchx/schedulers/kubernetes_scheduler.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77

88
"""
99
10+
Prerequisites
11+
==============
12+
1013
TorchX kubernetes scheduler depends on volcano and requires etcd intalled for distributed job execution.
1114
1215
Install volcano 1.4.0 version
@@ -23,7 +26,6 @@
2326
kubectl apply -f https://github.com/pytorch/torchx/blob/main/resources/etcd.yaml
2427
2528
26-
2729
Learn more about running distributed trainers :py:mod:`torchx.components.dist`
2830
2931
"""

0 commit comments

Comments
 (0)