|
5 | 5 | # LICENSE file in the root directory of this source tree.
|
6 | 6 |
|
7 | 7 | """
|
8 |
| -TorchX helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train` |
9 |
| -on the example of running single trainer job. Here we will be using |
10 |
| -the same :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`. |
11 |
| -but will run it in a distributed manner. |
| 8 | +For distributed training, TorchX relies on the scheduler's gang scheduling |
| 9 | +capabilities to schedule ``n`` copies of nodes. Once launched, the application |
| 10 | +is expected to be written in a way that leverages this topology, for instance, |
| 11 | +with PyTorch's |
| 12 | +`DDP <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_. |
| 13 | +You can express a variety of node topologies with TorchX by specifying multiple |
| 14 | +:py:class:`torchx.specs.Role` in your component's AppDef. Each role maps to |
| 15 | +a homogeneous group of nodes that performs a "role" (function) in the overall |
| 16 | +training. Scheduling-wise, TorchX launches each role as a sub-gang. |
12 | 17 |
|
13 |
| -TorchX uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes |
14 |
| -and expects that user applications will be written in |
15 |
| -`Distributed data parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ |
16 |
| -manner |
17 | 18 |
|
| 19 | +A DDP-style training job has a single role: trainers. Whereas a |
| 20 | +training job that uses parameter servers will have two roles: parameter server, trainer. |
| 21 | +You can specify different entrypoint (executable), num replicas, resource requirements, |
| 22 | +and more for each role. |
18 | 23 |
|
19 |
| -Distributed Trainer Execution |
20 |
| ------------------------------- |
21 | 24 |
|
22 |
| -In order to run your trainer on a single or multiple processes, remotely or locally, all you need to do is to |
23 |
| -write a distributed torchx component. The example that we will be using here is |
24 |
| -:ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component` |
| 25 | +DDP Builtin |
| 26 | +---------------- |
25 | 27 |
|
26 |
| -The component defines how the user application is launched and torchx will take care of translating this into |
27 |
| -scheduler-specific definitions. |
28 |
| -
|
29 |
| -.. note:: Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` |
30 |
| - before running the examples |
31 |
| -
|
32 |
| -Single node, multiple trainers |
33 |
| -========================================= |
34 |
| -
|
35 |
| -Try launching a single node, multiple trainers example: |
| 28 | +DDP-style trainers are common and easy to templetize since they are homogeneous |
| 29 | +single role AppDefs, so there is a builtin: ``dist.ddp``. Assuming your DDP |
| 30 | +training script is called ``main.py``, launch it as: |
36 | 31 |
|
37 | 32 | .. code:: shell-session
|
38 | 33 |
|
39 |
| - $ torchx run \\ |
40 |
| - -s local_cwd \\ |
41 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
42 |
| - --nnodes 1 \\ |
43 |
| - --nproc_per_node 2 \\ |
44 |
| - --rdzv_backend c10d --rdzv_endpoint localhost:29500 |
| 34 | + # locally, 1 node x 4 workers |
| 35 | + $ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
45 | 36 |
|
| 37 | + # locally, 2 node x 4 workers (8 total) |
| 38 | + $ torchx run -s local_cwd dist.ddp --entrypoint main.py \\ |
| 39 | + --rdzv_backend c10d \\ |
| 40 | + --nnodes 2 \\ |
| 41 | + --nproc_per_node 4 \\ |
46 | 42 |
|
47 |
| -.. note:: Use ``torchx runopts`` to see available schedulers |
| 43 | + # remote (needs you to setup an etcd server first!) |
| 44 | + $ torchx run -s kubernetes -cfg queue=default dist.ddp \\ |
| 45 | + --entrypoint main.py \\ |
| 46 | + --rdzv_backend etcd \\ |
| 47 | + --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 \\ |
| 48 | + --nnodes 2 \\ |
| 49 | + --nproc_per_node 4 \\ |
48 | 50 |
|
49 |
| -The ``./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` is reference to the component |
50 |
| -function: :ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component` |
51 | 51 |
|
| 52 | +There is a lot happening under the hood so we strongly encourage you |
| 53 | +to continue reading the rest of this section |
| 54 | +to get an understanding of how everything works. Also note that |
| 55 | +while ``dist.ddp`` is convenient, you'll find that authoring your own |
| 56 | +distributed component is not only easy (simplest way is to just copy |
| 57 | +``dist.ddp``!) but also leads to better flexbility and maintainability |
| 58 | +down the road since builtin APIs are subject to more changes than |
| 59 | +the more stable specs API. However the choice is yours, feel free to rely on |
| 60 | +the builtins if they meet your needs. |
52 | 61 |
|
53 |
| -.. note:: TorchX supports docker scheduler via ``-s local_docker`` command that currently works only for single |
54 |
| - node multiple processes due to issues `286 <https://github.com/pytorch/torchx/issues/286>`_ and |
55 |
| - `287 <https://github.com/pytorch/torchx/issues/287>`_ |
| 62 | +Distributed Training |
| 63 | +----------------------- |
56 | 64 |
|
| 65 | +Local Testing |
| 66 | +=================== |
57 | 67 |
|
| 68 | +.. note:: Please follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` first. |
58 | 69 |
|
59 |
| -Multiple nodes, multiple trainers |
60 |
| -=================================== |
61 | 70 |
|
62 |
| -It is simple to launch and manage distributed trainer with torchx. Lets try out and launch distributed |
63 |
| -trainer using docker. The following cmd to launch distributed job on docker: |
| 71 | +Running distributed training locally is a quick way to validate your |
| 72 | +training script. TorchX's local scheduler will create a process per |
| 73 | +replica (``--nodes``). The example below uses `torchelastic <https://pytorch.org/docs/stable/elastic/run.html>`_, |
| 74 | +as the main entrypoint of each node, which in turn spawns ``--nprocs_per_node`` number |
| 75 | +of trainers. In total you'll see ``nnodes*nprocs_per_node`` trainer processes and ``nnodes`` |
| 76 | +elastic agent procesess created on your local host. |
64 | 77 |
|
65 | 78 | .. code:: shell-session
|
66 | 79 |
|
67 |
| - Launch the trainer using torchx |
68 |
| - $ torchx run \\ |
69 |
| - -s local_cwd \\ |
70 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
71 |
| - --nnodes 2 \\ |
72 |
| - --nproc_per_node 2 \\ |
73 |
| - --rdzv_backend c10d \\ |
74 |
| - --rdzv_endpoint localhost:29500 |
75 |
| -
|
76 |
| -
|
77 |
| -.. note:: The command above will only work for hosts without GPU! |
78 |
| -
|
79 |
| -
|
80 |
| -This will run 4 trainers on two docker containers. ``local_docker`` assigns ``hostname`` |
81 |
| -to each of the container role using the pattern ``${$APP_NAME}-${ROLE_NAME}-${RELICA_ID}``. |
| 80 | + $ torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 81 | + --nnodes 2 \\ |
| 82 | + --nproc_per_node 2 \\ |
| 83 | + --rdzv_backend c10d \\ |
| 84 | + --rdzv_endpoint localhost:29500 |
82 | 85 |
|
83 | 86 |
|
84 |
| -TorchX supports ``kubernetes`` scheduler that allows you to execute distributed jobs on your kubernetes cluster. |
| 87 | +.. warning:: There is a known issue with ``local_docker`` (the default scheduler when no ``-s`` |
| 88 | + argument is supplied), hence we use ``-s local_cwd`` instead. Please track |
| 89 | + the progress of the fix on `issue-286 <https://github.com/pytorch/torchx/issues/286>`_, |
| 90 | + `issue-287 <https://github.com/pytorch/torchx/issues/287>`_. |
85 | 91 |
|
86 | 92 |
|
87 |
| -.. note:: Make sure that you install necessary dependencies on :ref:`schedulers/kubernetes:Prerequisites` |
88 |
| - before executing job |
| 93 | +Remote Launching |
| 94 | +==================== |
89 | 95 |
|
| 96 | +.. note:: Please follow the :ref:`schedulers/kubernetes:Prerequisites` first. |
90 | 97 |
|
91 |
| -The following command runs 2 pods on kubernetes cluster, each of the pods will occupy a single gpu. |
92 |
| -
|
| 98 | +The following example demonstrate launching the same job remotely on kubernetes. |
93 | 99 |
|
94 | 100 | .. code:: shell-session
|
95 | 101 |
|
96 |
| - $ torchx run -s kubernetes \\ |
97 |
| - --scheduler_args namespace=default,queue=default \\ |
98 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
99 |
| - --nnodes 2 \\ |
100 |
| - --nproc_per_node 1 \\ |
101 |
| - --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
102 |
| -
|
103 |
| -
|
104 |
| -The command above will launch distributed train job on kubernetes ``default`` namespace using volcano |
105 |
| -``default`` queue. In this example we used ``etcd`` rendezvous in comparison to the ``c10d`` rendezvous. |
106 |
| -It is important to use ``etcd`` rendezvous that uses ``etcd server`` since it is a best practice to perform |
107 |
| -peer discovery for distributed jobs. Read more about |
108 |
| -`rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
109 |
| -
|
110 |
| -
|
111 |
| -.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and |
112 |
| - change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` |
113 |
| - method. Modify ``resource_def`` to the number of GPUs that your host has. |
114 |
| -
|
115 |
| -The command should produce the following output: |
116 |
| -
|
117 |
| -.. code:: bash |
118 |
| -
|
| 102 | + $ torchx run -s kubernetes -cfg queue=default \\ |
| 103 | + ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 104 | + --nnodes 2 \\ |
| 105 | + --nproc_per_node 2 \\ |
| 106 | + --rdzv_backend etcd \\ |
| 107 | + --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
119 | 108 | torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
|
120 | 109 | torchx 2021-10-18 18:46:55 INFO AppStatus:
|
121 | 110 | msg: <NONE>
|
|
127 | 116 |
|
128 | 117 | torchx 2021-10-18 18:46:55 INFO Job URL: None
|
129 | 118 |
|
| 119 | +Note that the only difference compared to the local launch is the scheduler (``-s``) |
| 120 | +and ``--rdzv_backend``. etcd will also work in the local case, but we used ``c10d`` |
| 121 | +since it does not require additional setup. Note that this is a torchelastic requirement |
| 122 | +not TorchX. Read more about rendezvous `here <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
130 | 123 |
|
131 |
| -You can use the job url to query the status or logs of the job: |
132 |
| -
|
133 |
| -.. code:: shell-session |
134 |
| -
|
135 |
| - Change value to your unique app handle |
136 |
| - $ export APP_HANDLE=kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
137 |
| -
|
138 |
| - $ torchx status $APP_HANDLE |
139 |
| -
|
140 |
| - torchx 2021-10-18 18:47:44 INFO AppDef: |
141 |
| - State: SUCCEEDED |
142 |
| - Num Restarts: -1 |
143 |
| - Roles: |
144 |
| - *worker[0]:SUCCEEDED |
145 |
| -
|
146 |
| -Try running |
147 |
| -
|
148 |
| -.. code:: shell-session |
149 |
| -
|
150 |
| - $ torchx log $APP_HANDLE |
151 |
| -
|
152 |
| -
|
153 |
| -Builtin distributed components |
154 |
| ---------------------------------- |
155 |
| -
|
156 |
| -In the examples above we used custom components to launch user applications. It is not always the case that |
157 |
| -users need to write their own components. |
158 |
| -
|
159 |
| -TorchX comes with set of builtin component that describe typical execution patterns. |
160 |
| -
|
161 |
| -
|
162 |
| -dist.ddp |
163 |
| -========= |
164 |
| -
|
165 |
| -``dist.ddp`` is a component for applications that run as distributed jobs in a DDP manner. |
166 |
| -You can use it to quickly iterate over your application without the need of authoring your own component. |
167 |
| -
|
168 |
| -.. note:: |
169 |
| -
|
170 |
| - ``dist.ddp`` is a generic component, as a result it is good for quick iterations, but not production usage. |
171 |
| - It is recommended to author your own component if you want to put your application in production. |
172 |
| - Learn more :ref:`components/overview:Authoring` about how to author your component. |
173 |
| -
|
174 |
| -We will be using ``dist.ddp`` to execute the following example: |
175 |
| -
|
176 |
| -.. code:: python |
177 |
| -
|
178 |
| - # main.py |
179 |
| - import os |
180 |
| -
|
181 |
| - import torch |
182 |
| - import torch.distributed as dist |
183 |
| - import torch.nn.functional as F |
184 |
| - import torch.distributed.run |
185 |
| -
|
186 |
| - def compute_world_size(): |
187 |
| - rank = int(os.getenv("RANK", "0")) |
188 |
| - world_size = int(os.getenv("WORLD_SIZE", "1")) |
189 |
| - dist.init_process_group() |
190 |
| - print("successfully initialized process group") |
191 |
| -
|
192 |
| - t = F.one_hot(torch.tensor(rank), num_classes=world_size) |
193 |
| - dist.all_reduce(t) |
194 |
| - computed_world_size = int(torch.sum(t).item()) |
195 |
| - print( |
196 |
| - f"rank: {rank}, actual world_size: {world_size}, computed world_size: {computed_world_size}" |
197 |
| - ) |
198 |
| -
|
199 |
| - if __name__ == "__main__": |
200 |
| - compute_world_size() |
201 |
| -
|
202 |
| -
|
203 |
| -Single trainer on desktop |
204 |
| -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
205 |
| -
|
206 |
| -We can run this example on desktop on four processes using the following cmd: |
207 |
| -
|
208 |
| -.. code:: shell-session |
209 |
| -
|
210 |
| - $ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
211 |
| -
|
212 |
| -
|
213 |
| -Single trainer on kubernetes cluster |
214 |
| -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
215 |
| -
|
216 |
| -We can execute it on the kubernetes cluster |
217 |
| -
|
218 |
| -.. code:: shell-session |
219 |
| -
|
220 |
| - $ torchx run -s kubernetes \\ |
221 |
| - --scheduler_args namespace=default,queue=default \\ |
222 |
| - dist.ddp --entrypoint main.py --nproc_per_node 4 |
| 124 | +.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and |
| 125 | + change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` |
| 126 | + method. Modify ``resource_def`` to the number of GPUs that your host has. |
223 | 127 |
|
224 | 128 |
|
225 | 129 | Components APIs
|
|
0 commit comments