Skip to content

Commit ddd42a6

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
cleanup duplicated builtins (ddp), export the correct builtins for fb (#60)
Summary: Pull Request resolved: #60 As title states... New docs: {F625095641} Reviewed By: d4l3k Differential Revision: D29146570 fbshipit-source-id: 95355c616ef5b889f9ceed190b34e325255a41da
1 parent aaeee71 commit ddd42a6

File tree

20 files changed

+230
-164
lines changed

20 files changed

+230
-164
lines changed

docs/source/components/distributed.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ Distributed
44
.. automodule:: torchx.components.dist
55
.. currentmodule:: torchx.components.dist
66

7-
.. autofunction:: torchx.components.dist.ddp.get_app_spec
7+
.. autofunction:: torchx.components.dist.ddp

docs/source/components/serve.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ Serve
55
.. currentmodule:: torchx.components.serve
66

77

8-
.. currentmodule:: torchx.components.serve.serve
8+
.. currentmodule:: torchx.components.serve
99
.. autofunction:: torchserve

docs/source/components/utils.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ Utils
44
.. automodule:: torchx.components.utils
55
.. currentmodule:: torchx.components.utils
66

7-
.. autofunction:: torchx.components.utils.echo.get_app_spec
7+
.. autofunction:: torchx.components.utils.echo
8+
.. autofunction:: torchx.components.utils.touch

docs/source/quickstart.rst

Lines changed: 85 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,16 @@ For now lets take a look at the builtins
1717
1818
$ torchx builtins
1919
Found <n> builtin configs:
20-
1. echo
21-
2. touch
20+
...
21+
i. utils.echo
22+
j. utils.touch
2223
...
2324
24-
Echo looks familiar and simple. Lets understand how to run ``echo``.
25+
Echo looks familiar and simple. Lets understand how to run ``utils.echo``.
2526

2627
.. code-block:: shell-session
2728
28-
$ torchx run --scheduler local echo --help
29+
$ torchx run --scheduler local utils.echo --help
2930
usage: torchx run echo [-h] [--msg MSG]
3031
3132
Echos a message
@@ -38,7 +39,7 @@ We can see that it takes a ``--msg`` argument. Lets try running it locally
3839

3940
.. code-block:: shell-session
4041
41-
$ torchx run --scheduler local echo --msg "hello world"
42+
$ torchx run --scheduler local utils.echo --msg "hello world"
4243
4344
.. note:: ``echo`` in this context is just an app spec. It is not the application
4445
logic itself but rather just the "job definition" for running `/bin/echo`.
@@ -58,16 +59,16 @@ This is just a regular python file where we define the app spec.
5859

5960
.. code-block:: shell-session
6061
61-
$ touch ~/echo_torchx.py
62+
$ touch ~/test.py
6263
63-
Now copy paste the following into echo_torchx.py
64+
Now copy paste the following into test.py
6465

6566
::
6667

6768
import torchx.specs as specs
6869

6970

70-
def get_app_spec(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
71+
def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
7172
"""
7273
Echos a message to stdout (calls /bin/echo)
7374

@@ -83,8 +84,8 @@ Now copy paste the following into echo_torchx.py
8384
name="echo",
8485
entrypoint="/bin/echo",
8586
image="/tmp",
86-
args=[f"replica #{specs.macros.replica_id}: msg"],
87-
num_replicas=1,
87+
args=[f"replica #{specs.macros.replica_id}: {msg}"],
88+
num_replicas=num_replicas,
8889
)
8990
],
9091
)
@@ -103,13 +104,86 @@ Now lets try running our custom ``echo``
103104

104105
.. code-block:: shell-session
105106
106-
$ torchx run --scheduler local ~/echo_torchx.py --num_replicas 4 --msg "foobar"
107+
$ torchx run --scheduler local ~/test.py:echo --num_replicas 4 --msg "foobar"
107108
108109
replica #0: foobar
109110
replica #1: foobar
110111
replica #2: foobar
111112
replica #3: foobar
112113
114+
Running on Other Images
115+
-----------------------------
116+
So far we've run ``utils.echo`` with ``image=/tmp``. This means that the
117+
``entrypoint`` we specified is relative to ``/tmp``. That did not matter for us
118+
since we specified an absolute path as the entrypoint (``entrypoint=/bin/echo``).
119+
Had we specified ``entrypoint=echo`` the local scheduler would have tried to invoke
120+
``/tmp/echo``.
121+
122+
If you have a pre-built application binary, setting the image to a local directory is a
123+
quick way to validate the application and the ``specs.AppDef``. But its not all
124+
that useful if you want to run the application on a remote scheduler
125+
(see :ref:`Running On Other Schedulers`).
126+
127+
.. note:: The ``image`` string in ``specs.Role`` is an identifier to a container image
128+
supported by the scheduler. Refer to the scheduler documentation to find out
129+
what container image is supported by the scheduler you want to use.
130+
131+
For ``local`` scheduler we can see that it supports both a local directory
132+
and docker as the image:
133+
134+
.. code-block:: shell-session
135+
136+
$ torchx runopts local
137+
138+
{ 'image_type': { 'default': 'dir',
139+
'help': 'image type. One of [dir, docker]',
140+
'type': 'str'},
141+
... <omitted for brevity> ...
142+
143+
144+
.. note:: Before proceeding, you will need docker installed. If you have not done so already
145+
follow the install instructions on: https://docs.docker.com/get-docker/
146+
147+
Now lets try running ``echo`` from a docker container. Modify echo's ``AppDef``
148+
in ``~/test.py`` you created in the previous section to make the ``image="ubuntu:latest"``.
149+
150+
::
151+
152+
import torchx.specs as specs
153+
154+
155+
def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
156+
"""
157+
Echos a message to stdout (calls /bin/echo)
158+
159+
Args:
160+
num_replicas: number of copies (in parallel) to run
161+
msg: message to echo
162+
163+
"""
164+
return specs.AppDef(
165+
name="echo",
166+
roles=[
167+
specs.Role(
168+
name="echo",
169+
entrypoint="/bin/echo",
170+
image="ubuntu:latest", # IMAGE NOW POINTS TO THE UBUNTU DOCKER IMAGE
171+
args=[f"replica #{specs.macros.replica_id}: {msg}"],
172+
num_replicas=num_replicas,
173+
)
174+
],
175+
)
176+
177+
Try running the echo app
178+
179+
.. code-block:: shell-session
180+
181+
$ torchx run --scheduler local \
182+
--scheduler_args image_type=docker \
183+
~/test.py:echo \
184+
--num_replicas 4 \
185+
--msg "foobar from docker!"
186+
113187
Running On Other Schedulers
114188
-----------------------------
115189
So far we've launched components locally. Lets take a look at how to run this on

examples/pipelines/kfp/kfp_pipeline.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616
something that can be used within KFP.
1717
"""
1818

19-
2019
# %%
2120
# Input Arguments
2221
# ###############
@@ -105,7 +104,6 @@
105104

106105
args: argparse.Namespace = parser.parse_args(sys.argv[1:])
107106

108-
109107
# %%
110108
# Creating the Components
111109
# #######################
@@ -166,7 +164,7 @@
166164

167165
import os.path
168166

169-
from torchx.components.serve.serve import torchserve
167+
from torchx.components.serve import torchserve
170168

171169
serve_app: specs.AppDef = torchserve(
172170
model_path=os.path.join(args.output_path, "model.mar"),

torchx/cli/cmd_run.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,7 @@ def add_arguments(self, subparser: argparse.ArgumentParser) -> None:
195195
"--scheduler",
196196
type=str,
197197
help="Name of the scheduler to use",
198+
default="default",
198199
)
199200
subparser.add_argument(
200201
"--scheduler_args",

torchx/components/dist.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Copyright (c) Facebook, Inc. and its affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""
8+
Components for applications that run as distributed jobs. Many of the
9+
components in this section are simply topological, meaning that they define
10+
the layout of the nodes in a distributed setting and take the actual
11+
binaries that each group of nodes (``specs.Role``) runs.
12+
"""
13+
14+
from typing import Dict, Optional
15+
16+
import torchx.specs as specs
17+
from torchx.components.base import torch_dist_role
18+
19+
20+
def ddp(
21+
image: str,
22+
entrypoint: str,
23+
resource: Optional[str] = None,
24+
nnodes: int = 1,
25+
nproc_per_node: int = 1,
26+
base_image: Optional[str] = None,
27+
name: str = "test_name",
28+
role: str = "worker",
29+
env: Optional[Dict[str, str]] = None,
30+
*script_args: str,
31+
) -> specs.AppDef:
32+
"""
33+
Distributed data parallel style application (one role, multi-replica).
34+
35+
Args:
36+
image: container image.
37+
entrypoint: script or binary to run within the image.
38+
resource: Registered named resource.
39+
nnodes: Number of nodes.
40+
nproc_per_node: Number of processes per node.
41+
name: Name of the application.
42+
base_image: container base image (not required) .
43+
role: Name of the ddp role.
44+
script: Main script.
45+
env: Env variables.
46+
script_args: Script arguments.
47+
48+
Returns:
49+
specs.AppDef: Torchx AppDef
50+
"""
51+
52+
ddp_role = torch_dist_role(
53+
name=role,
54+
image=image,
55+
base_image=base_image,
56+
entrypoint=entrypoint,
57+
resource=resource or specs.NULL_RESOURCE,
58+
script_args=list(script_args),
59+
script_envs=env,
60+
nproc_per_node=nproc_per_node,
61+
nnodes=nnodes,
62+
max_restarts=0,
63+
).replicas(nnodes)
64+
65+
return specs.AppDef(name).of(ddp_role)

torchx/components/dist/__init__.py

Lines changed: 0 additions & 12 deletions
This file was deleted.

torchx/components/dist/ddp.py

Lines changed: 0 additions & 15 deletions
This file was deleted.

torchx/components/distributed.py

Lines changed: 0 additions & 51 deletions
This file was deleted.

0 commit comments

Comments
 (0)