|
5 | 5 | # LICENSE file in the root directory of this source tree.
|
6 | 6 |
|
7 | 7 | """
|
8 |
| -Torchx helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train` |
| 8 | +TorchX helps you to run your distributed trainer jobs. Check out :py:mod:`torchx.components.train` |
9 | 9 | on the example of running single trainer job. Here we will be using
|
10 | 10 | the same :ref:`examples_apps/lightning_classy_vision/train:Trainer App Example`.
|
11 | 11 | but will run it in a distributed manner.
|
12 | 12 |
|
13 |
| -Torchx uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes |
| 13 | +TorchX uses `Torch distributed run <https://pytorch.org/docs/stable/elastic/run.html>`_ to launch user processes |
14 | 14 | and expects that user applications will be written in
|
15 | 15 | `Distributed data parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
|
16 | 16 | manner
|
|
26 | 26 | The component defines how the user application is launched and torchx will take care of translating this into
|
27 | 27 | scheduler-specific definitions.
|
28 | 28 |
|
29 |
| -.. note: |
| 29 | +.. note:: Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` |
| 30 | + before running the examples |
30 | 31 |
|
31 |
| - Follow :ref:`examples_apps/lightning_classy_vision/component:Prerequisites of running examples` to |
32 |
| - before running the examples |
33 |
| -
|
34 |
| -Single node, multiple trainers (desktop) |
| 32 | +Single node, multiple trainers |
35 | 33 | =========================================
|
36 | 34 |
|
37 |
| -Try launching a single node, multiple trainers example on your desktop: |
| 35 | +Try launching a single node, multiple trainers example: |
38 | 36 |
|
39 |
| -.. code:: bash |
| 37 | +.. code:: shell-session |
| 38 | +
|
| 39 | + $ torchx run \\ |
| 40 | + -s local_cwd \\ |
| 41 | + ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 42 | + --nnodes 1 \\ |
| 43 | + --nproc_per_node 2 \\ |
| 44 | + --rdzv_backend c10d --rdzv_endpoint localhost:29500 |
40 | 45 |
|
41 |
| - torchx run -s local_cwd \ |
42 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \ |
43 |
| - --nnodes 1 --nproc_per_node 2 \ |
44 |
| - --rdzv_backend c10d --rdzv_endpoint localhost:29500 |
45 | 46 |
|
46 |
| -The `./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist` is reference to the component |
| 47 | +.. note:: Use ``torchx runopts`` to see available schedulers |
| 48 | +
|
| 49 | +The ``./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` is reference to the component |
47 | 50 | function: :ref:`examples_apps/lightning_classy_vision/component:Distributed Trainer Component`
|
48 | 51 |
|
49 | 52 |
|
50 |
| -Single node, multiple trainers (kubernetes) |
51 |
| -============================================ |
| 53 | +.. note:: TorchX supports docker scheduler via ``-s local_docker`` command that currently works only for single |
| 54 | + node multiple processes due to issues `286 <https://github.com/pytorch/torchx/issues/286>`_ and |
| 55 | + `287 <https://github.com/pytorch/torchx/issues/287>`_ |
52 | 56 |
|
53 | 57 |
|
54 |
| -Now lets launch the same component on the kubernetes cluster. |
55 |
| -Check out :py:mod:`torchx.schedulers.kubernetes_scheduler` on dependencies that needs to be installed |
56 |
| -before running using `kubernetes` scheduler. |
57 | 58 |
|
| 59 | +Multiple nodes, multiple trainers |
| 60 | +=================================== |
58 | 61 |
|
59 |
| -We can use the following cmd to launch application on kubernetes: |
| 62 | +It is simple to launch and manage distributed trainer with torchx. Lets try out and launch distributed |
| 63 | +trainer using docker. The following cmd to launch distributed job on docker: |
60 | 64 |
|
61 |
| -.. code:: bash |
| 65 | +.. code:: shell-session |
| 66 | +
|
| 67 | + Launch the trainer using torchx |
| 68 | + $ torchx run \\ |
| 69 | + -s local_cwd \\ |
| 70 | + ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 71 | + --nnodes 2 \\ |
| 72 | + --nproc_per_node 2 \\ |
| 73 | + --rdzv_backend c10d \\ |
| 74 | + --rdzv_endpoint localhost:29500 |
| 75 | +
|
| 76 | +
|
| 77 | +.. note:: The command above will only work for hosts without GPU! |
| 78 | +
|
| 79 | +
|
| 80 | +This will run 4 trainers on two docker containers. ``local_docker`` assigns ``hostname`` |
| 81 | +to each of the container role using the pattern ``${$APP_NAME}-${ROLE_NAME}-${RELICA_ID}``. |
| 82 | +
|
| 83 | +
|
| 84 | +TorchX supports ``kubernetes`` scheduler that allows you to execute distributed jobs on your kubernetes cluster. |
| 85 | +
|
| 86 | +
|
| 87 | +.. note:: Make sure that you install necessary dependencies on :ref:`schedulers/kubernetes:Prerequisites` |
| 88 | + before executing job |
| 89 | +
|
| 90 | +
|
| 91 | +The following command runs 2 pods on kubernetes cluster, each of the pods will occupy a single gpu. |
| 92 | +
|
| 93 | +
|
| 94 | +.. code:: shell-session |
62 | 95 |
|
63 |
| - torchx run -s kubernetes --scheduler_args namespace=default,queue=default \ |
64 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \ |
65 |
| - --nnodes 1 --nproc_per_node 2 |
| 96 | + $ torchx run -s kubernetes \\ |
| 97 | + --scheduler_args namespace=default,queue=default \\ |
| 98 | + ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \\ |
| 99 | + --nnodes 2 \\ |
| 100 | + --nproc_per_node 1 \\ |
| 101 | + --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
66 | 102 |
|
67 |
| -The `namespaces` arg corresponds to the kubernetes namespace that you want to launch. |
68 |
| -The `queue` arg is the volcano `queue <https://volcano.sh/en/docs/queue/>`_. |
69 | 103 |
|
| 104 | +The command above will launch distributed train job on kubernetes ``default`` namespace using volcano |
| 105 | +``default`` queue. In this example we used ``etcd`` rendezvous in comparison to the ``c10d`` rendezvous. |
| 106 | +It is important to use ``etcd`` rendezvous that uses ``etcd server`` since it is a best practice to perform |
| 107 | +peer discovery for distributed jobs. Read more about |
| 108 | +`rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
70 | 109 |
|
71 |
| -Example of output: |
| 110 | +
|
| 111 | +.. note:: For GPU training, keep ``nproc_per_node`` equal to the amount of GPUs on the host and |
| 112 | + change the resource requirements in ``torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist`` |
| 113 | + method. Modify ``resource_def`` to the number of GPUs that your host has. |
| 114 | +
|
| 115 | +The command should produce the following output: |
72 | 116 |
|
73 | 117 | .. code:: bash
|
74 | 118 |
|
75 |
| - kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
76 | 119 | torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
|
77 | 120 | torchx 2021-10-18 18:46:55 INFO AppStatus:
|
78 | 121 | msg: <NONE>
|
|
87 | 130 |
|
88 | 131 | You can use the job url to query the status or logs of the job:
|
89 | 132 |
|
90 |
| -.. code:: bash |
| 133 | +.. code:: shell-session |
91 | 134 |
|
92 |
| - torchx status kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
| 135 | + Change value to your unique app handle |
| 136 | + $ export APP_HANDLE=kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng |
| 137 | +
|
| 138 | + $ torchx status $APP_HANDLE |
93 | 139 |
|
94 | 140 | torchx 2021-10-18 18:47:44 INFO AppDef:
|
95 | 141 | State: SUCCEEDED
|
96 | 142 | Num Restarts: -1
|
97 | 143 | Roles:
|
98 | 144 | *worker[0]:SUCCEEDED
|
99 | 145 |
|
100 |
| -Try running `torchx log kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng`. |
101 |
| -
|
102 |
| -
|
103 |
| -Multiple nodes, multiple trainers (kubernetes) |
104 |
| -=============================================== |
105 |
| -
|
106 |
| -It is simple to launch multiple nodes trainer in kubernetes: |
107 |
| -
|
108 |
| -.. code:: bash |
109 |
| -
|
110 |
| - torchx run -s kubernetes --scheduler_args namespace=default,queue=default \ |
111 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \ |
112 |
| - --nnodes 2 --nproc_per_node 2 |
113 |
| -
|
114 |
| -The command above will launch distributed train job on kubernetes `default` namespace using volcano |
115 |
| -`default` queue. It will use etcd service accessible on `etcd-server:2379` to perform |
116 |
| -`etcd rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
117 |
| -
|
118 |
| -You can overwrite rendezvous endpoint: |
119 |
| -
|
120 |
| -.. code:: bash |
| 146 | +Try running |
121 | 147 |
|
122 |
| - torchx run -s kubernetes --scheduler_args namespace=default,queue=default \ |
123 |
| - ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \ |
124 |
| - --nnodes 2 --nproc_per_node 1 \ |
125 |
| - --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 |
| 148 | +.. code:: shell-session |
126 | 149 |
|
127 |
| -.. note:: For GPU training, keep `nproc_per_node` equal to the amount of GPUs the instace has. |
128 |
| -
|
129 |
| -The command above will launch distributed train job on kubernetes `default` namespace using volcano |
130 |
| -`default` queue. It will use etcd service accessible on `etcd-server:2379` to perform |
131 |
| -`etcd rendezvous <https://pytorch.org/docs/stable/elastic/rendezvous.html>`_. |
| 150 | + $ torchx log $APP_HANDLE |
132 | 151 |
|
133 | 152 |
|
134 | 153 | Builtin distributed components
|
|
137 | 156 | In the examples above we used custom components to launch user applications. It is not always the case that
|
138 | 157 | users need to write their own components.
|
139 | 158 |
|
140 |
| -Torchx comes with set of builtin component that describe typical execution patterns. |
| 159 | +TorchX comes with set of builtin component that describe typical execution patterns. |
141 | 160 |
|
142 | 161 |
|
143 | 162 | dist.ddp
|
@@ -186,24 +205,25 @@ def compute_world_size():
|
186 | 205 |
|
187 | 206 | We can run this example on desktop on four processes using the following cmd:
|
188 | 207 |
|
189 |
| -.. code:: bash |
| 208 | +.. code:: shell-session |
190 | 209 |
|
191 |
| - torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
| 210 | + $ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4 |
192 | 211 |
|
193 | 212 |
|
194 | 213 | Single trainer on kubernetes cluster
|
195 | 214 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
196 | 215 |
|
197 | 216 | We can execute it on the kubernetes cluster
|
198 | 217 |
|
199 |
| -.. code:: bash |
200 |
| -
|
201 |
| - torchx run -s kubernetes --scheduler_args namespace=default,queue=default\ |
202 |
| -dist.ddp --entrypoint main.py --nproc_per_node 4 |
| 218 | +.. code:: shell-session |
203 | 219 |
|
| 220 | + $ torchx run -s kubernetes \\ |
| 221 | + --scheduler_args namespace=default,queue=default \\ |
| 222 | + dist.ddp --entrypoint main.py --nproc_per_node 4 |
204 | 223 |
|
205 | 224 |
|
206 | 225 | Components APIs
|
| 226 | +----------------- |
207 | 227 | """
|
208 | 228 |
|
209 | 229 | from typing import Any, Dict, Optional
|
|
0 commit comments