|
5 | 5 | # LICENSE file in the root directory of this source tree. |
6 | 6 |
|
7 | 7 | """ |
8 | | -TorchX helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train` |
9 | | -on the example of running single trainer job. Here we will be using |
10 | | -the same :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`. |
11 | | -but will run it in a distributed manner. |
| 8 | +For distributed training, TorchX relies on the scheduler's gang scheduling |
| 9 | +capabilities to schedule ``n`` copies of nodes. Once launched, the application |
| 10 | +is expected to be written in a way that leverages this topology, for instance, |
| 11 | +with PyTorch's |
| 12 | +`DDP <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_. |
| 13 | +You can express a variety of node topologies with TorchX by specifying multiple |
| 14 | +:py:class:`torchx.specs.Role` in your component's AppDef. Each role maps to |
| 15 | +a homogeneous group of nodes that performs a "role" (function) in the overall |
| 16 | +training. Scheduling-wise, TorchX launches each role as a sub-gang. |
12 | 17 |
|
13 | | -TorchX uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes |
14 | | -and expects that user applications will be written in |
15 | | -`Distributed data parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ |
16 | | -manner |
17 | 18 |
|
| 19 | +A DDP-style training job has a single role: trainers. Whereas a |
| 20 | +training job that uses parameter servers will have two roles: parameter server, trainer. |
| 21 | +You can specify different entrypoint (executable), num replicas, resource requirements, |
| 22 | +and more for each role. |
18 | 23 |
|
19 | | -Distributed Trainer Execution |
20 | | ------------------------------- |
21 | 24 |
|
22 | | -In order to run your trainer on a single or multiple processes, remotely or locally, all you need to do is to |
23 | | -write a distributed torchx component. The example that we will be using here is |
24 | | -:ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component` |
| 25 | +DDP Builtin |
| 26 | +---------------- |
25 | 27 |
|
26 | | -The component defines how the user application is launched and torchx will take care of translating this into |
27 | | -scheduler-specific definitions. |
28 | | -
|
29 | | -.. note:: Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` |
30 | | - before running the examples |
31 | | -
|
32 | | -Single node, multiple trainers |
33 | | -========================================= |
34 | | -
|
35 | | -Try launching a single node, multiple trainers example: |
| 28 | +DDP-style trainers are common and easy to templetize since they are homogeneous |
| 29 | +single role AppDefs, so there is a builtin: ``dist.ddp``. Assuming your DDP |
| 30 | +training script is called ``main.py``, launch it as: |
36 | 31 |
|
37 | 32 | .. code:: shell-session |
38 | 33 |
|
39 | | - $ torchx run \\ |
40 | | - -s local_cwd \\ |
41 | | - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
42 | | - --nnodes 1 \\ |
43 | | - --nproc_per_node 2 \\ |
44 | | - --rdzv_backend c10d --rdzv_endpoint localhost:29500 |
| 34 | + # locally, 1 node x 4 workers |
| 35 | + $ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
45 | 36 |
|
| 37 | + # locally, 2 node x 4 workers (8 total) |
| 38 | + $ torchx run -s local_cwd dist.ddp --entrypoint main.py \\ |
| 39 | + --rdzv_backend c10d \\ |
| 40 | + --nnodes 2 \\ |
| 41 | + --nproc_per_node 4 \\ |
46 | 42 |
|
47 | | -.. note:: Use ``torchx runopts`` to see available schedulers |
| 43 | + # remote (needs you to setup an etcd server first!) |
| 44 | + $ torchx run -s kubernetes -cfg queue=default dist.ddp \\ |
| 45 | + --entrypoint main.py \\ |
| 46 | + --rdzv_backend etcd \\ |
| 47 | + --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 \\ |
| 48 | + --nnodes 2 \\ |
| 49 | + --nproc_per_node 4 \\ |
48 | 50 |
|
49 | | -The ``./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` is reference to the component |
50 | | -function: :ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component` |
51 | 51 |
|
| 52 | +There is a lot happening under the hood so we strongly encourage you |
| 53 | +to continue reading the rest of this section |
| 54 | +to get an understanding of how everything works. Also note that |
| 55 | +while ``dist.ddp`` is convenient, you'll find that authoring your own |
| 56 | +distributed component is not only easy (simplest way is to just copy |
| 57 | +``dist.ddp``!) but also leads to better flexbility and maintainability |
| 58 | +down the road since builtin APIs are subject to more changes than |
| 59 | +the more stable specs API. However the choice is yours, feel free to rely on |
| 60 | +the builtins if they meet your needs. |
52 | 61 |
|
53 | | -.. note:: TorchX supports docker scheduler via ``-s local_docker`` command that currently works only for single |
54 | | - node multiple processes due to issues `286 <https://github.com/pytorch/torchx/issues/286>`_ and |
55 | | - `287 <https://github.com/pytorch/torchx/issues/287>`_ |
| 62 | +Distributed Training |
| 63 | +----------------------- |
56 | 64 |
|
| 65 | +Local Testing |
| 66 | +=================== |
57 | 67 |
|
| 68 | +.. note:: Please follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` first. |
58 | 69 |
|
59 | | -Multiple nodes, multiple trainers |
60 | | -=================================== |
61 | 70 |
|
62 | | -It is simple to launch and manage distributed trainer with torchx. Lets try out and launch distributed |
63 | | -trainer using docker. The following cmd to launch distributed job on docker: |
| 71 | +Running distributed training locally is a quick way to validate your |
| 72 | +training script. TorchX's local scheduler will create a process per |
| 73 | +replica (``--nodes``). The example below uses `torchelastic <https://pytorch.org/docs/stable/elastic/run.html>`_, |
| 74 | +as the main entrypoint of each node, which in turn spawns ``--nprocs_per_node`` number |
| 75 | +of trainers. In total you'll see ``nnodes*nprocs_per_node`` trainer processes and ``nnodes`` |
| 76 | +elastic agent procesess created on your local host. |
64 | 77 |
|
65 | 78 | .. code:: shell-session |
66 | 79 |
|
67 | | - Launch the trainer using torchx |
68 | | - $ torchx run \\ |
69 | | - -s local_cwd \\ |
70 | | - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
71 | | - --nnodes 2 \\ |
72 | | - --nproc_per_node 2 \\ |
73 | | - --rdzv_backend c10d \\ |
74 | | - --rdzv_endpoint localhost:29500 |
75 | | -
|
76 | | -
|
77 | | -.. note:: The command above will only work for hosts without GPU! |
78 | | -
|
79 | | -
|
80 | | -This will run 4 trainers on two docker containers. ``local_docker`` assigns ``hostname`` |
81 | | -to each of the container role using the pattern ``${$APP_NAME}-${ROLE_NAME}-${RELICA_ID}``. |
| 80 | + $ torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 81 | + --nnodes 2 \\ |
| 82 | + --nproc_per_node 2 \\ |
| 83 | + --rdzv_backend c10d \\ |
| 84 | + --rdzv_endpoint localhost:29500 |
82 | 85 |
|
83 | 86 |
|
84 | | -TorchX supports ``kubernetes`` scheduler that allows you to execute distributed jobs on your kubernetes cluster. |
| 87 | +.. warning:: There is a known issue with ``local_docker`` (the default scheduler when no ``-s`` |
| 88 | + argument is supplied), hence we use ``-s local_cwd`` instead. Please track |
| 89 | + the progress of the fix on `issue-286 <https://github.com/pytorch/torchx/issues/286>`_, |
| 90 | + `issue-287 <https://github.com/pytorch/torchx/issues/287>`_. |
85 | 91 |
|
86 | 92 |
|
87 | | -.. note:: Make sure that you install necessary dependencies on :ref:`schedulers/kubernetes:Prerequisites` |
88 | | - before executing job |
| 93 | +Remote Launching |
| 94 | +==================== |
89 | 95 |
|
| 96 | +.. note:: Please follow the :ref:`schedulers/kubernetes:Prerequisites` first. |
90 | 97 |
|
91 | | -The following command runs 2 pods on kubernetes cluster, each of the pods will occupy a single gpu. |
92 | | -
|
| 98 | +The following example demonstrate launching the same job remotely on kubernetes. |
93 | 99 |
|
94 | 100 | .. code:: shell-session |
95 | 101 |
|
96 | | - $ torchx run -s kubernetes \\ |
97 | | - --scheduler_args namespace=default,queue=default \\ |
98 | | - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
99 | | - --nnodes 2 \\ |
100 | | - --nproc_per_node 1 \\ |
101 | | - --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
102 | | -
|
103 | | -
|
104 | | -The command above will launch distributed train job on kubernetes ``default`` namespace using volcano |
105 | | -``default`` queue. In this example we used ``etcd`` rendezvous in comparison to the ``c10d`` rendezvous. |
106 | | -It is important to use ``etcd`` rendezvous that uses ``etcd server`` since it is a best practice to perform |
107 | | -peer discovery for distributed jobs. Read more about |
108 | | -`rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
109 | | -
|
110 | | -
|
111 | | -.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and |
112 | | - change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` |
113 | | - method. Modify ``resource_def`` to the number of GPUs that your host has. |
114 | | -
|
115 | | -The command should produce the following output: |
116 | | -
|
117 | | -.. code:: bash |
118 | | -
|
| 102 | + $ torchx run -s kubernetes -cfg queue=default \\ |
| 103 | + ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 104 | + --nnodes 2 \\ |
| 105 | + --nproc_per_node 2 \\ |
| 106 | + --rdzv_backend etcd \\ |
| 107 | + --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
119 | 108 | torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
120 | 109 | torchx 2021-10-18 18:46:55 INFO AppStatus: |
121 | 110 | msg: <NONE> |
|
127 | 116 |
|
128 | 117 | torchx 2021-10-18 18:46:55 INFO Job URL: None |
129 | 118 |
|
| 119 | +Note that the only difference compared to the local launch is the scheduler (``-s``) |
| 120 | +and ``--rdzv_backend``. etcd will also work in the local case, but we used ``c10d`` |
| 121 | +since it does not require additional setup. Note that this is a torchelastic requirement |
| 122 | +not TorchX. Read more about rendezvous `here <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
130 | 123 |
|
131 | | -You can use the job url to query the status or logs of the job: |
132 | | -
|
133 | | -.. code:: shell-session |
134 | | -
|
135 | | - Change value to your unique app handle |
136 | | - $ export APP_HANDLE=kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
137 | | -
|
138 | | - $ torchx status $APP_HANDLE |
139 | | -
|
140 | | - torchx 2021-10-18 18:47:44 INFO AppDef: |
141 | | - State: SUCCEEDED |
142 | | - Num Restarts: -1 |
143 | | - Roles: |
144 | | - *worker[0]:SUCCEEDED |
145 | | -
|
146 | | -Try running |
147 | | -
|
148 | | -.. code:: shell-session |
149 | | -
|
150 | | - $ torchx log $APP_HANDLE |
151 | | -
|
152 | | -
|
153 | | -Builtin distributed components |
154 | | ---------------------------------- |
155 | | -
|
156 | | -In the examples above we used custom components to launch user applications. It is not always the case that |
157 | | -users need to write their own components. |
158 | | -
|
159 | | -TorchX comes with set of builtin component that describe typical execution patterns. |
160 | | -
|
161 | | -
|
162 | | -dist.ddp |
163 | | -========= |
164 | | -
|
165 | | -``dist.ddp`` is a component for applications that run as distributed jobs in a DDP manner. |
166 | | -You can use it to quickly iterate over your application without the need of authoring your own component. |
167 | | -
|
168 | | -.. note:: |
169 | | -
|
170 | | - ``dist.ddp`` is a generic component, as a result it is good for quick iterations, but not production usage. |
171 | | - It is recommended to author your own component if you want to put your application in production. |
172 | | - Learn more :ref:`components/overview:Authoring` about how to author your component. |
173 | | -
|
174 | | -We will be using ``dist.ddp`` to execute the following example: |
175 | | -
|
176 | | -.. code:: python |
177 | | -
|
178 | | - # main.py |
179 | | - import os |
180 | | -
|
181 | | - import torch |
182 | | - import torch.distributed as dist |
183 | | - import torch.nn.functional as F |
184 | | - import torch.distributed.run |
185 | | -
|
186 | | - def compute_world_size(): |
187 | | - rank = int(os.getenv("RANK", "0")) |
188 | | - world_size = int(os.getenv("WORLD_SIZE", "1")) |
189 | | - dist.init_process_group() |
190 | | - print("successfully initialized process group") |
191 | | -
|
192 | | - t = F.one_hot(torch.tensor(rank), num_classes=world_size) |
193 | | - dist.all_reduce(t) |
194 | | - computed_world_size = int(torch.sum(t).item()) |
195 | | - print( |
196 | | - f"rank: {rank}, actual world_size: {world_size}, computed world_size: {computed_world_size}" |
197 | | - ) |
198 | | -
|
199 | | - if __name__ == "__main__": |
200 | | - compute_world_size() |
201 | | -
|
202 | | -
|
203 | | -Single trainer on desktop |
204 | | -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
205 | | -
|
206 | | -We can run this example on desktop on four processes using the following cmd: |
207 | | -
|
208 | | -.. code:: shell-session |
209 | | -
|
210 | | - $ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
211 | | -
|
212 | | -
|
213 | | -Single trainer on kubernetes cluster |
214 | | -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
215 | | -
|
216 | | -We can execute it on the kubernetes cluster |
217 | | -
|
218 | | -.. code:: shell-session |
219 | | -
|
220 | | - $ torchx run -s kubernetes \\ |
221 | | - --scheduler_args namespace=default,queue=default \\ |
222 | | - dist.ddp --entrypoint main.py --nproc_per_node 4 |
| 124 | +.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and |
| 125 | + change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` |
| 126 | + method. Modify ``resource_def`` to the number of GPUs that your host has. |
223 | 127 |
|
224 | 128 |
|
225 | 129 | Components APIs |
|
0 commit comments