Skip to content

Commit ddcae5c

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
(torchx/docs) fix invalid references to docker when running with local_cwd in the components.dist docstring (#294)
Summary: Pull Request resolved: #294 Fixes incorrect docs that claim to run on docker containers when the command is `-s local_cwd` (see screenshot below) {F672188530} Reviewed By: aivanou Differential Revision: D31828767 fbshipit-source-id: 7ce86d4034907d0323bd21cc1eddea67efcfb98c
1 parent 3290ec3 commit ddcae5c

File tree

2 files changed

+80
-180
lines changed

2 files changed

+80
-180
lines changed

torchx/components/dist.py

Lines changed: 78 additions & 174 deletions
Original file line numberDiff line numberDiff line change
@@ -5,117 +5,106 @@
55
# LICENSE file in the root directory of this source tree.
66

77
"""
8-
TorchX helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train`
9-
on the example of running single trainer job. Here we will be using
10-
the same :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`.
11-
but will run it in a distributed manner.
8+
For distributed training, TorchX relies on the scheduler's gang scheduling
9+
capabilities to schedule ``n`` copies of nodes. Once launched, the application
10+
is expected to be written in a way that leverages this topology, for instance,
11+
with PyTorch's
12+
`DDP <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_.
13+
You can express a variety of node topologies with TorchX by specifying multiple
14+
:py:class:`torchx.specs.Role` in your component's AppDef. Each role maps to
15+
a homogeneous group of nodes that performs a "role" (function) in the overall
16+
training. Scheduling-wise, TorchX launches each role as a sub-gang.
1217
13-
TorchX uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes
14-
and expects that user applications will be written in
15-
`Distributed data parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
16-
manner
1718
19+
A DDP-style training job has a single role: trainers. Whereas a
20+
training job that uses parameter servers will have two roles: parameter server, trainer.
21+
You can specify different entrypoint (executable), num replicas, resource requirements,
22+
and more for each role.
1823
19-
Distributed Trainer Execution
20-
------------------------------
2124
22-
In order to run your trainer on a single or multiple processes, remotely or locally, all you need to do is to
23-
write a distributed torchx component. The example that we will be using here is
24-
:ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component`
25+
DDP Builtin
26+
----------------
2527
26-
The component defines how the user application is launched and torchx will take care of translating this into
27-
scheduler-specific definitions.
28-
29-
.. note:: Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples`
30-
before running the examples
31-
32-
Single node, multiple trainers
33-
=========================================
34-
35-
Try launching a single node, multiple trainers example:
28+
DDP-style trainers are common and easy to templetize since they are homogeneous
29+
single role AppDefs, so there is a builtin: ``dist.ddp``. Assuming your DDP
30+
training script is called ``main.py``, launch it as:
3631
3732
.. code:: shell-session
3833
39-
$ torchx run \\
40-
-s local_cwd \\
41-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
42-
--nnodes 1 \\
43-
--nproc_per_node 2 \\
44-
--rdzv_backend c10d --rdzv_endpoint localhost:29500
34+
# locally, 1 node x 4 workers
35+
$ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4
4536
37+
# locally, 2 node x 4 workers (8 total)
38+
$ torchx run -s local_cwd dist.ddp --entrypoint main.py \\
39+
--rdzv_backend c10d \\
40+
--nnodes 2 \\
41+
--nproc_per_node 4 \\
4642
47-
.. note:: Use ``torchx runopts`` to see available schedulers
43+
# remote (needs you to setup an etcd server first!)
44+
$ torchx run -s kubernetes -cfg queue=default dist.ddp \\
45+
--entrypoint main.py \\
46+
--rdzv_backend etcd \\
47+
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379 \\
48+
--nnodes 2 \\
49+
--nproc_per_node 4 \\
4850
49-
The ``./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` is reference to the component
50-
function: :ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component`
5151
52+
There is a lot happening under the hood so we strongly encourage you
53+
to continue reading the rest of this section
54+
to get an understanding of how everything works. Also note that
55+
while ``dist.ddp`` is convenient, you'll find that authoring your own
56+
distributed component is not only easy (simplest way is to just copy
57+
``dist.ddp``!) but also leads to better flexbility and maintainability
58+
down the road since builtin APIs are subject to more changes than
59+
the more stable specs API. However the choice is yours, feel free to rely on
60+
the builtins if they meet your needs.
5261
53-
.. note:: TorchX supports docker scheduler via ``-s local_docker`` command that currently works only for single
54-
node multiple processes due to issues `286 <https://github.com/pytorch/torchx/issues/286>`_ and
55-
`287 <https://github.com/pytorch/torchx/issues/287>`_
62+
Distributed Training
63+
-----------------------
5664
65+
Local Testing
66+
===================
5767
68+
.. note:: Please follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` first.
5869
59-
Multiple nodes, multiple trainers
60-
===================================
6170
62-
It is simple to launch and manage distributed trainer with torchx. Lets try out and launch distributed
63-
trainer using docker. The following cmd to launch distributed job on docker:
71+
Running distributed training locally is a quick way to validate your
72+
training script. TorchX's local scheduler will create a process per
73+
replica (``--nodes``). The example below uses `torchelastic <https://pytorch.org/docs/stable/elastic/run.html>`_,
74+
as the main entrypoint of each node, which in turn spawns ``--nprocs_per_node`` number
75+
of trainers. In total you'll see ``nnodes*nprocs_per_node`` trainer processes and ``nnodes``
76+
elastic agent procesess created on your local host.
6477
6578
.. code:: shell-session
6679
67-
Launch the trainer using torchx
68-
$ torchx run \\
69-
-s local_cwd \\
70-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
71-
--nnodes 2 \\
72-
--nproc_per_node 2 \\
73-
--rdzv_backend c10d \\
74-
--rdzv_endpoint localhost:29500
75-
76-
77-
.. note:: The command above will only work for hosts without GPU!
78-
79-
80-
This will run 4 trainers on two docker containers. ``local_docker`` assigns ``hostname``
81-
to each of the container role using the pattern ``${$APP_NAME}-${ROLE_NAME}-${RELICA_ID}``.
80+
$ torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
81+
--nnodes 2 \\
82+
--nproc_per_node 2 \\
83+
--rdzv_backend c10d \\
84+
--rdzv_endpoint localhost:29500
8285
8386
84-
TorchX supports ``kubernetes`` scheduler that allows you to execute distributed jobs on your kubernetes cluster.
87+
.. warning:: There is a known issue with ``local_docker`` (the default scheduler when no ``-s``
88+
argument is supplied), hence we use ``-s local_cwd`` instead. Please track
89+
the progress of the fix on `issue-286 <https://github.com/pytorch/torchx/issues/286>`_,
90+
`issue-287 <https://github.com/pytorch/torchx/issues/287>`_.
8591
8692
87-
.. note:: Make sure that you install necessary dependencies on :ref:`schedulers/kubernetes:Prerequisites`
88-
before executing job
93+
Remote Launching
94+
====================
8995
96+
.. note:: Please follow the :ref:`schedulers/kubernetes:Prerequisites` first.
9097
91-
The following command runs 2 pods on kubernetes cluster, each of the pods will occupy a single gpu.
92-
98+
The following example demonstrate launching the same job remotely on kubernetes.
9399
94100
.. code:: shell-session
95101
96-
$ torchx run -s kubernetes \\
97-
--scheduler_args namespace=default,queue=default \\
98-
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
99-
--nnodes 2 \\
100-
--nproc_per_node 1 \\
101-
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
102-
103-
104-
The command above will launch distributed train job on kubernetes ``default`` namespace using volcano
105-
``default`` queue. In this example we used ``etcd`` rendezvous in comparison to the ``c10d`` rendezvous.
106-
It is important to use ``etcd`` rendezvous that uses ``etcd server`` since it is a best practice to perform
107-
peer discovery for distributed jobs. Read more about
108-
`rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
109-
110-
111-
.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and
112-
change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist``
113-
method. Modify ``resource_def`` to the number of GPUs that your host has.
114-
115-
The command should produce the following output:
116-
117-
.. code:: bash
118-
102+
$ torchx run -s kubernetes -cfg queue=default \\
103+
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\
104+
--nnodes 2 \\
105+
--nproc_per_node 2 \\
106+
--rdzv_backend etcd \\
107+
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
119108
torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
120109
torchx 2021-10-18 18:46:55 INFO AppStatus:
121110
msg: <NONE>
@@ -127,99 +116,14 @@
127116
128117
torchx 2021-10-18 18:46:55 INFO Job URL: None
129118
119+
Note that the only difference compared to the local launch is the scheduler (``-s``)
120+
and ``--rdzv_backend``. etcd will also work in the local case, but we used ``c10d``
121+
since it does not require additional setup. Note that this is a torchelastic requirement
122+
not TorchX. Read more about rendezvous `here <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_.
130123
131-
You can use the job url to query the status or logs of the job:
132-
133-
.. code:: shell-session
134-
135-
Change value to your unique app handle
136-
$ export APP_HANDLE=kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
137-
138-
$ torchx status $APP_HANDLE
139-
140-
torchx 2021-10-18 18:47:44 INFO AppDef:
141-
State: SUCCEEDED
142-
Num Restarts: -1
143-
Roles:
144-
*worker[0]:SUCCEEDED
145-
146-
Try running
147-
148-
.. code:: shell-session
149-
150-
$ torchx log $APP_HANDLE
151-
152-
153-
Builtin distributed components
154-
---------------------------------
155-
156-
In the examples above we used custom components to launch user applications. It is not always the case that
157-
users need to write their own components.
158-
159-
TorchX comes with set of builtin component that describe typical execution patterns.
160-
161-
162-
dist.ddp
163-
=========
164-
165-
``dist.ddp`` is a component for applications that run as distributed jobs in a DDP manner.
166-
You can use it to quickly iterate over your application without the need of authoring your own component.
167-
168-
.. note::
169-
170-
``dist.ddp`` is a generic component, as a result it is good for quick iterations, but not production usage.
171-
It is recommended to author your own component if you want to put your application in production.
172-
Learn more :ref:`components/overview:Authoring` about how to author your component.
173-
174-
We will be using ``dist.ddp`` to execute the following example:
175-
176-
.. code:: python
177-
178-
# main.py
179-
import os
180-
181-
import torch
182-
import torch.distributed as dist
183-
import torch.nn.functional as F
184-
import torch.distributed.run
185-
186-
def compute_world_size():
187-
rank = int(os.getenv("RANK", "0"))
188-
world_size = int(os.getenv("WORLD_SIZE", "1"))
189-
dist.init_process_group()
190-
print("successfully initialized process group")
191-
192-
t = F.one_hot(torch.tensor(rank), num_classes=world_size)
193-
dist.all_reduce(t)
194-
computed_world_size = int(torch.sum(t).item())
195-
print(
196-
f"rank: {rank}, actual world_size: {world_size}, computed world_size: {computed_world_size}"
197-
)
198-
199-
if __name__ == "__main__":
200-
compute_world_size()
201-
202-
203-
Single trainer on desktop
204-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
205-
206-
We can run this example on desktop on four processes using the following cmd:
207-
208-
.. code:: shell-session
209-
210-
$ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4
211-
212-
213-
Single trainer on kubernetes cluster
214-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
215-
216-
We can execute it on the kubernetes cluster
217-
218-
.. code:: shell-session
219-
220-
$ torchx run -s kubernetes \\
221-
--scheduler_args namespace=default,queue=default \\
222-
dist.ddp --entrypoint main.py --nproc_per_node 4
124+
.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and
125+
change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist``
126+
method. Modify ``resource_def`` to the number of GPUs that your host has.
223127
224128
225129
Components APIs

torchx/components/train.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,7 @@
1414
1. :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`
1515
2. :ref:`examples_apps/lightning_classy_vision/component:Trainer Component`
1616
3. :ref:`component_best_practices:Component Best Practices`
17-
18-
19-
You can learn more about authoring your own components: :py:mod:`torchx.components`
20-
21-
Torchx has great support for simplifying execution of distributed jobs, that you can learn more
22-
:py:mod:`torchx.components.dist`
17+
4. See :py:mod:`torchx.components` to learn more about authoring components
18+
5. For more information on distributed training see :py:mod:`torchx.components.dist`.
2319
2420
"""

0 commit comments

Comments
 (0)