|
1 | | -Combining Distributed DataParallel with Distributed RPC Framework |
| 1 | +๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DDP)๊ณผ ๋ถ์ฐ RPC ํ๋ ์์ํฌ ๊ฒฐํฉ |
2 | 2 | ================================================================= |
3 | | -**Authors**: `Pritam Damania <https://github.com/pritamdamania87>`_ and `Yi Wang <https://github.com/SciPioneer>`_ |
4 | | - |
5 | | - |
6 | | -This tutorial uses a simple example to demonstrate how you can combine |
7 | | -`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP) |
8 | | -with the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
9 | | -to combine distributed data parallelism with distributed model parallelism to |
10 | | -train a simple model. Source code of the example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__. |
11 | | - |
12 | | -Previous tutorials, |
13 | | -`Getting Started With Distributed Data Parallel <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__ |
14 | | -and `Getting Started with Distributed RPC Framework <https://tutorials.pytorch.kr/intermediate/rpc_tutorial.html>`__, |
15 | | -described how to perform distributed data parallel and distributed model |
16 | | -parallel training respectively. Although, there are several training paradigms |
17 | | -where you might want to combine these two techniques. For example: |
18 | | - |
19 | | -1) If we have a model with a sparse part (large embedding table) and a dense |
20 | | - part (FC layers), we might want to put the embedding table on a parameter |
21 | | - server and replicate the FC layer across multiple trainers using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
22 | | - The `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
23 | | - can be used to perform embedding lookups on the parameter server. |
24 | | -2) Enable hybrid parallelism as described in the `PipeDream <https://arxiv.org/abs/1806.03377>`__ paper. |
25 | | - We can use the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
26 | | - to pipeline stages of the model across multiple workers and replicate each |
27 | | - stage (if needed) using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
| 3 | +**์ ์**: `Pritam Damania <https://github.com/pritamdamania87>`__ and `Yi Wang <https://github.com/SciPioneer>`__ |
| 4 | + |
| 5 | +**๋ฒ์ญ**: `๋ฐ๋ค์ <https://github.com/dajeongPark-dev>`_ |
| 6 | + |
| 7 | + |
| 8 | +์ด ํํ ๋ฆฌ์ผ์ ๊ฐ๋จํ ์์ ๋ฅผ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ(distributed data parallelism)์ |
| 9 | +๋ถ์ฐ ๋ชจ๋ธ ๋ณ๋ ฌ ์ฒ๋ฆฌ(distributed model parallelism)๋ฅผ ๊ฒฐํฉํ์ฌ ๊ฐ๋จํ ๋ชจ๋ธ ํ์ต์ํฌ ๋ |
| 10 | +`๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DistributedDataParallel) <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP)๊ณผ |
| 11 | +`๋ถ์ฐ RPC ํ๋ ์์ํฌ(Distributed RPC framework) <https://pytorch.org/docs/master/rpc.html>`__๋ฅผ ๊ฒฐํฉํ๋ ๋ฐฉ๋ฒ์ ๋ํด ์ค๋ช
ํฉ๋๋ค. |
| 12 | +์์ ์ ์์ค ์ฝ๋๋ `์ฌ๊ธฐ <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__์์ ํ์ธํ ์ ์์ต๋๋ค. |
| 13 | + |
| 14 | +์ด์ ํํ ๋ฆฌ์ผ ๋ด์ฉ์ด์๋ |
| 15 | +`๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์์ํ๊ธฐ <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__์ |
| 16 | +`๋ถ์ฐ RPC ํ๋ ์์ํฌ ์์ํ๊ธฐ <https://tutorials.pytorch.kr/intermediate/rpc_tutorial.html>`__๋ |
| 17 | +๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ๋ฐ ๋ถ์ฐ ๋ชจ๋ธ ๋ณ๋ ฌ ํ์ต์ ๊ฐ๊ฐ ์ํํ๋ ๋ฐฉ๋ฒ์ ๋ํด ์ค๋ช
ํฉ๋๋ค. |
| 18 | +๊ทธ๋ฌ๋ ์ด ๋ ๊ฐ์ง ๊ธฐ์ ์ ๊ฒฐํฉํ ์ ์๋ ๋ช ๊ฐ์ง ํ์ต ํจ๋ฌ๋ค์์ด ์์ต๋๋ค. ์๋ฅผ ๋ค์ด: |
| 19 | + |
| 20 | +1) ํฌ์ ๋ถ๋ถ(ํฐ ์๋ฒ ๋ฉ ํ
์ด๋ธ)๊ณผ ๋ฐ์ง ๋ถ๋ถ(FC ๋ ์ด์ด)์ด ์๋ ๋ชจ๋ธ์ด ์๋ ๊ฒฝ์ฐ, |
| 21 | + ๋งค๊ฐ๋ณ์ ์๋ฒ(parameter server)์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(embedding table)์ ๋๊ณ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ ์ฌ์ฉํ์ฌ |
| 22 | + ์ฌ๋ฌ ํธ๋ ์ด๋์ ๊ฑธ์ณ FC ๋ ์ด์ด๋ฅผ ๋ณต์ ํ๋ ๊ฒ์ ์ํ ์๋ ์์ต๋๋ค. |
| 23 | + ์ด๋ `๋ถ์ฐ RPC ํ๋ ์์ํฌ <https://pytorch.org/docs/master/rpc.html>`__๋ |
| 24 | + ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ๋ ๋ฐ ์ฌ์ฉํ ์ ์์ต๋๋ค. |
| 25 | +2) ๋ค์์ `PipeDream <https://arxiv.org/abs/1806.03377>`__ ๋ฌธ์์์ ์ค๋ช
๋ ํ์ด๋ธ๋ฆฌ๋ ๋ณ๋ ฌ ์ฒ๋ฆฌ ํ์ฑํํ๊ธฐ ์
๋๋ค. |
| 26 | + `๋ถ์ฐ RPC ํ๋ ์์ํฌ <https://pytorch.org/docs/master/rpc.html>`__๋ฅผ ์ฌ์ฉํ์ฌ |
| 27 | + ์ฌ๋ฌ worker์ ๊ฑธ์ณ ๋ชจ๋ธ์ ๋จ๊ณ๋ฅผ ํ์ดํ๋ผ์ธ(pipeline)ํ ์ ์๊ณ |
| 28 | + (ํ์์ ๋ฐ๋ผ) `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ ์ด์ฉํด์ |
| 29 | + ๊ฐ ๋จ๊ณ๋ฅผ ๋ณต์ ํ ์ ์์ต๋๋ค. |
28 | 30 |
|
29 | 31 | | |
30 | | -In this tutorial we will cover case 1 mentioned above. We have a total of 4 |
31 | | -workers in our setup as follows: |
| 32 | +์ด ํํ ๋ฆฌ์ผ์์๋ ์์์ ์ธ๊ธํ ์ฒซ ๋ฒ์งธ ๊ฒฝ์ฐ๋ฅผ ๋ค๋ฃฐ ๊ฒ์
๋๋ค. |
| 33 | +๋ค์๊ณผ ๊ฐ์ด ์ด 4๊ฐ์ worker๊ฐ ์์ต๋๋ค: |
32 | 34 |
|
33 | 35 |
|
34 | | -1) 1 Master, which is responsible for creating an embedding table |
35 | | - (nn.EmbeddingBag) on the parameter server. The master also drives the |
36 | | - training loop on the two trainers. |
37 | | -2) 1 Parameter Server, which basically holds the embedding table in memory and |
38 | | - responds to RPCs from the Master and Trainers. |
39 | | -3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst |
40 | | - themselves using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
41 | | - The trainers are also responsible for executing the forward pass, backward |
42 | | - pass and optimizer step. |
| 36 | +1) 1๊ฐ์ ๋ง์คํฐ๋ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(nn.EmbeddingBag) ์์ฑ์ ๋ด๋นํฉ๋๋ค. |
| 37 | + ๋ํ ๋ง์คํฐ๋ ๋ ํธ๋ ์ด๋์ ํ์ต ๋ฃจํ๋ฅผ ์ํํฉ๋๋ค. |
| 38 | +2) 1๊ฐ์ ๋งค๊ฐ๋ณ์ ์๋ฒ๋ ๊ธฐ๋ณธ์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ณด์ ํ๊ณ ๋ง์คํฐ ๋ฐ ํธ๋ ์ด๋์ RPC์ ์๋ตํฉ๋๋ค. |
| 39 | +3) 2๊ฐ์ ํธ๋ ์ด๋๋ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ |
| 40 | + ์ฌ์ฉํ์ฌ ์์ฒด์ ์ผ๋ก ๋ณต์ ๋๋ FC ๋ ์ด์ด(nn.Linear)๋ฅผ ์ ์ฅํฉ๋๋ค. |
| 41 | + ํธ๋ ์ด๋๋ ๋ํ ์๋ฐฉํฅ ์ ๋ฌ(forward pass), ์ญ๋ฐฉํฅ ์ ๋ฌ(backward pass) ๋ฐ ์ต์ ํ ๋จ๊ณ๋ฅผ ์คํํด์ผ ํฉ๋๋ค. |
43 | 42 |
|
44 | 43 | | |
45 | | -The entire training process is executed as follows: |
46 | | - |
47 | | -1) The master creates a `RemoteModule <https://pytorch.org/docs/master/rpc.html#remotemodule>`__ |
48 | | - that holds an embedding table on the Parameter Server. |
49 | | -2) The master, then kicks off the training loop on the trainers and passes the |
50 | | - remote module to the trainers. |
51 | | -3) The trainers create a ``HybridModel`` which first performs an embedding lookup |
52 | | - using the remote module provided by the master and then executes the |
53 | | - FC layer which is wrapped inside DDP. |
54 | | -4) The trainer executes the forward pass of the model and uses the loss to |
55 | | - execute the backward pass using `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__. |
56 | | -5) As part of the backward pass, the gradients for the FC layer are computed |
57 | | - first and synced to all trainers via allreduce in DDP. |
58 | | -6) Next, Distributed Autograd propagates the gradients to the parameter server, |
59 | | - where the gradients for the embedding table are updated. |
60 | | -7) Finally, the `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__ is used to update all the parameters. |
61 | | - |
62 | | - |
63 | | -.. attention:: |
64 | | - |
65 | | - You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__ |
66 | | - for the backward pass if you're combining DDP and RPC. |
67 | | - |
68 | | - |
69 | | -Now, let's go through each part in detail. Firstly, we need to setup all of our |
70 | | -workers before we can perform any training. We create 4 processes such that |
71 | | -ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the |
72 | | -parameter server. |
73 | | - |
74 | | -We initialize the RPC framework on all 4 workers using the TCP init_method. |
75 | | -Once RPC initialization is done, the master creates a remote module that holds an `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__ |
76 | | -layer on the Parameter Server using `RemoteModule <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule>`__. |
77 | | -The master then loops through each trainer and kicks off the training loop by |
78 | | -calling ``_run_trainer`` on each trainer using `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__. |
79 | | -Finally, the master waits for all training to finish before exiting. |
80 | | - |
81 | | -The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2 |
82 | | -(for two trainers) using `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__. |
83 | | -Next, they initialize the RPC framework using the TCP init_method. Note that |
84 | | -the ports are different in RPC initialization and ProcessGroup initialization. |
85 | | -This is to avoid port conflicts between initialization of both frameworks. |
86 | | -Once the initialization is done, the trainers just wait for the ``_run_trainer`` |
87 | | -RPC from the master. |
88 | | - |
89 | | -The parameter server just initializes the RPC framework and waits for RPCs from |
90 | | -the trainers and master. |
| 44 | +์ ์ฒด์ ์ธ ํ์ต๊ณผ์ ์ ๋ค์๊ณผ ๊ฐ์ด ์คํ๋ฉ๋๋ค: |
| 45 | + |
| 46 | +1) ๋ง์คํฐ๋ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ด๊ณ ์๋ |
| 47 | + `์๊ฒฉ ๋ชจ๋(RemoteModule) <https://pytorch.org/docs/master/rpc.html#remotemodule>`__์ ์์ฑํฉ๋๋ค. |
| 48 | +2) ๊ทธ๋ฐ ๋ค์ ๋ง์คํฐ๋ ํธ๋ ์ด๋์ ํ์ต ๋ฃจํ๋ฅผ ์์ํ๊ณ ์๊ฒฉ ๋ชจ๋(remote module)์ ํธ๋ ์ด๋์๊ฒ ์ ๋ฌํฉ๋๋ค. |
| 49 | +3) ํธ๋ ์ด๋๋ ๋จผ์ ๋ง์คํฐ์์ ์ ๊ณตํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ |
| 50 | + ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ ๋ค์ DDP ๋ด๋ถ์ ๊ฐ์ธ์ง FC ๋ ์ด์ด๋ฅผ ์คํํ๋ ``HybridModel``์ ์์ฑํฉ๋๋ค. |
| 51 | +4) ํธ๋ ์ด๋๋ ๋ชจ๋ธ์ ์๋ฐฉํฅ ์ ๋ฌ์ ์คํํ๊ณ ์์ค์ ์ฌ์ฉํ์ฌ `๋ถ์ฐ Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__๋ฅผ |
| 52 | + ์ฌ์ฉํ์ฌ ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์คํํฉ๋๋ค. |
| 53 | +5) ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์ผ๋ถ๋ก FC ๋ ์ด์ด์ ๋ณํ๋๊ฐ ๋จผ์ ๊ณ์ฐ๋๊ณ DDP์ allreduce๋ฅผ ํตํด ๋ชจ๋ ํธ๋ ์ด๋์ ๋๊ธฐํ๋ฉ๋๋ค. |
| 54 | +6) ๋ค์์ผ๋ก, ๋ถ์ฐ Autograd๋ ๋งค๊ฐ๋ณ์ ์๋ฒ๋ก ๋ณํ๋๋ฅผ ์ ํํ๊ณ ๊ทธ๊ณณ์์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ณํ๋๊ฐ ์
๋ฐ์ดํธ๋ฉ๋๋ค. |
| 55 | +7) ๋ง์ง๋ง์ผ๋ก, `๋ถ์ฐ ์ตํฐ๋ง์ด์ (DistributedOptimizer) <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__๋ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์
๋ฐ์ดํธํ๋ ๋ฐ ์ฌ์ฉ๋ฉ๋๋ค. |
| 56 | +
|
| 57 | +.. ์ฃผ์์ฌํญ:: |
| 58 | +
|
| 59 | + DDP์ RPC๋ฅผ ๊ฒฐํฉํ ๋, ์ญ๋ฐฉํฅ ์ ๋ฌ์ ๋ํด ํญ์ |
| 60 | + `๋ถ์ฐ Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__๋ฅผ ์ฌ์ฉํด์ผ ํฉ๋๋ค. |
| 61 | +
|
| 62 | +
|
| 63 | +์ด์ ๊ฐ ๋ถ๋ถ์ ์์ธํ ์ดํด๋ณด๊ฒ ์ต๋๋ค. |
| 64 | +๋จผ์ ํ์ต์ ์ํํ๊ธฐ ์ ์ ๋ชจ๋ ์์
์๋ฅผ ์ค์ ํด์ผ ํฉ๋๋ค. |
| 65 | +์์ 0๊ณผ 1์ ํธ๋ ์ด๋, ์์ 2๋ ๋ง์คํฐ, ์์ 3์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ธ 4๊ฐ์ ํ๋ก์ธ์ค๋ฅผ ๋ง๋ญ๋๋ค. |
| 66 | +
|
| 67 | +TCP init_method๋ฅผ ์ฌ์ฉํ์ฌ 4๊ฐ์ ๋ชจ๋ worker์์ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํฉ๋๋ค. |
| 68 | +RPC ์ด๊ธฐํ๊ฐ ๋๋๋ฉด, ๋ง์คํฐ๋ `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__ ๋ ์ด์ด๋ฅผ |
| 69 | +`์๊ฒฉ ๋ชจ๋(RemoteModule) <https://pytorch.org/docs/master/rpc.html#remotemodule>`__์ ์ฌ์ฉํ์ฌ |
| 70 | +๋งค๊ฐ๋ณ์ ์๋ฒ์ ๋ด๊ณ ์๋ ์๊ฒฉ ๋ชจ๋ ํ๋๋ฅผ ์์ฑํฉ๋๋ค. |
| 71 | +๊ทธ๋ฐ ๋ค์ ๋ง์คํฐ๋ ๊ฐ ํธ๋ ์ด๋๋ฅผ ๋ฐ๋ณตํ๊ณ `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__๋ฅผ |
| 72 | +์ฌ์ฉํ์ฌ ๊ฐ ํธ๋ ์ด๋์์ ``_run_trainer``๋ฅผ ํธ์ถํ์ฌ ๋ฐ๋ณต ํ์ต์ ์์ํฉ๋๋ค. |
| 73 | +๋ง์ง๋ง์ผ๋ก ๋ง์คํฐ๋ ์ข
๋ฃํ๊ธฐ ์ ์ ๋ชจ๋ ํ์ต์ด ์๋ฃ๋ ๋๊น์ง ๊ธฐ๋ค๋ฆฝ๋๋ค. |
| 74 | +
|
| 75 | +ํธ๋ ์ด๋๋ `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__์ ์ฌ์ฉํ์ฌ |
| 76 | +(2๊ฐ์ ํธ๋ ์ด๋) world_size=2๋ก DDP๋ฅผ ์ํด ``ProcessGroup``์ ์ด๊ธฐํํฉ๋๋ค. |
| 77 | +๋ค์์ผ๋ก TCP init_method๋ฅผ ์ฌ์ฉํ์ฌ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํฉ๋๋ค. |
| 78 | +์ฌ๊ธฐ์ ์ฃผ์ ํ ์ ์ RPC ์ด๊ธฐํ์ ProgressGroup ์ด๊ธฐํ์์ ์ฐ์ด๋ ํฌํธ(port)๊ฐ ๋ค๋ฅด๋ค๋ ๊ฒ์
๋๋ค. |
| 79 | +์ด๋ ๋ ํ๋ ์์ํฌ์ ์ด๊ธฐํ ๊ฐ์ ํฌํธ ์ถฉ๋์ ํผํ๊ธฐ ์ํด์ ์
๋๋ค. |
| 80 | +์ด๊ธฐํ๊ฐ ์๋ฃ๋๋ฉด ํธ๋ ์ด๋๋ ๋ง์คํฐ์ ``_run_trainer` RPC๋ฅผ ๊ธฐ๋ค๋ฆฌ๊ธฐ๋ง ํ๋ฉด ๋ฉ๋๋ค. |
| 81 | +
|
| 82 | +ํ๋ผํผํฐ ์๋ฒ๋ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํ๊ณ ํธ๋ ์ด๋์ ๋ง์คํฐ์ RPC๋ฅผ ๊ธฐ๋ค๋ฆฝ๋๋ค. |
91 | 83 |
|
92 | 84 |
|
93 | 85 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py |
94 | 86 | :language: py |
95 | 87 | :start-after: BEGIN run_worker |
96 | 88 | :end-before: END run_worker |
97 | 89 |
|
98 | | -Before we discuss details of the Trainer, let's introduce the ``HybridModel`` that |
99 | | -the trainer uses. As described below, the ``HybridModel`` is initialized using a |
100 | | -remote module that holds an embedding table (``remote_emb_module``) on the parameter server and the ``device`` |
101 | | -to use for DDP. The initialization of the model wraps an |
102 | | -`nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__ |
103 | | -layer inside DDP to replicate and synchronize this layer across all trainers. |
| 90 | +ํธ๋ ์ด๋์ ๋ํ ์์ธํ ์ค๋ช
์ ์์, ํธ๋ ์ด๋๊ฐ ์ฌ์ฉํ๋ ``HybridModel``์ ๋ํด ์ค๋ช
๋๋ฆฌ๊ฒ ์ต๋๋ค. |
| 91 | +์๋์ ์ค๋ช
๋ ๋๋ก ``HybridModel``์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(``remote_emb_module``)๊ณผ DDP์ ์ฌ์ฉํ ``device``๋ฅผ ๋ณด์ ํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ ์ด๊ธฐํ๋ฉ๋๋ค. |
| 92 | +๋ชจ๋ธ ์ด๊ธฐํ๋ DDP ๋ด๋ถ์ `nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__ ๋ ์ด์ด๋ฅผ |
| 93 | +๊ฐ์ธ ๋ชจ๋ ํธ๋ ์ด๋์์ ์ด ๋ ์ด์ด๋ฅผ ๋ณต์ ํ๊ณ ๋๊ธฐํํฉ๋๋ค. |
| 94 | +
|
104 | 95 |
|
105 | | -The forward method of the model is pretty straightforward. It performs an |
106 | | -embedding lookup on the parameter server using RemoteModule's ``forward`` |
107 | | -and passes its output onto the FC layer. |
| 96 | +๋ชจ๋ธ์ ์๋ฐฉํฅ(forward) ํจ์๋ ๊ฝค ๊ฐ๋จํฉ๋๋ค. |
| 97 | +RemoteModule์ ``forward``๋ฅผ ์ฌ์ฉํ์ฌ ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ๊ณ ๊ทธ ์ถ๋ ฅ์ FC ๋ ์ด์ด์ ์ ๋ฌํฉ๋๋ค. |
108 | 98 |
|
109 | 99 |
|
110 | 100 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py |
111 | 101 | :language: py |
112 | 102 | :start-after: BEGIN hybrid_model |
113 | 103 | :end-before: END hybrid_model |
114 | 104 |
|
115 | | -Next, let's look at the setup on the Trainer. The trainer first creates the |
116 | | -``HybridModel`` described above using a remote module that holds the embedding table on the |
117 | | -parameter server and its own rank. |
118 | | - |
119 | | -Now, we need to retrieve a list of RRefs to all the parameters that we would |
120 | | -like to optimize with `DistributedOptimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__. |
121 | | -To retrieve the parameters for the embedding table from the parameter server, |
122 | | -we can call RemoteModule's `remote_parameters <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters>`__, |
123 | | -which basically walks through all the parameters for the embedding table and returns |
124 | | -a list of RRefs. The trainer calls this method on the parameter server via RPC |
125 | | -to receive a list of RRefs to the desired parameters. Since the |
126 | | -DistributedOptimizer always takes a list of RRefs to parameters that need to |
127 | | -be optimized, we need to create RRefs even for the local parameters for our |
128 | | -FC layers. This is done by walking ``model.fc.parameters()``, creating an RRef for |
129 | | -each parameter and appending it to the list returned from ``remote_parameters()``. |
130 | | -Note that we cannnot use ``model.parameters()``, |
131 | | -because it will recursively call ``model.remote_emb_module.parameters()``, |
132 | | -which is not supported by ``RemoteModule``. |
133 | | - |
134 | | -Finally, we create our DistributedOptimizer using all the RRefs and define a |
135 | | -CrossEntropyLoss function. |
| 105 | +๋ค์์ผ๋ก ํธ๋ ์ด๋์ ์ค์ ์ ์ดํด๋ณด๊ฒ ์ต๋๋ค. |
| 106 | +ํธ๋ ์ด๋๋ ๋จผ์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ๊ณผ ์์ฒด ์์๋ฅผ ๋ณด์ ํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ |
| 107 | +์์์ ์ค๋ช
ํ ``HybridModel``์ ์์ฑํฉ๋๋ค. |
| 108 | +
|
| 109 | +์ด์ `๋ถ์ฐ ์ตํฐ๋ง์ด์ (DistributedOptimizer) <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__๋ก |
| 110 | +์ต์ ํํ๋ ค๋ ๋ชจ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ๊ฒ์ํด์ผ ํฉ๋๋ค. |
| 111 | +๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฒ์ํ๊ธฐ ์ํด |
| 112 | +RemoteModule์ `remote_parameters <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters>`__๋ฅผ ํธ์ถํ ์ ์์ต๋๋ค. |
| 113 | +๊ทธ๋ฆฌ๊ณ ์ด๊ฒ์ ๊ธฐ๋ณธ์ ์ผ๋ก ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์ดํด๋ณด๊ณ RRef ๋ชฉ๋ก์ ๋ฐํํฉ๋๋ค. |
| 114 | +ํธ๋ ์ด๋๋ RPC๋ฅผ ํตํด ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์ด ๋ฉ์๋๋ฅผ ํธ์ถํ์ฌ ์ํ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ์์ ํฉ๋๋ค. |
| 115 | +DistributedOptimizer๋ ํญ์ ์ต์ ํํด์ผ ํ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ๊ฐ์ ธ์ค๊ธฐ ๋๋ฌธ์ FC ๋ ์ด์ด์ ์ ์ญ ๋งค๊ฐ๋ณ์์ ๋ํด์๋ RRef๋ฅผ ์์ฑํด์ผ ํฉ๋๋ค. |
| 116 | +์ด๊ฒ์ ``model.fc.parameters()``๋ฅผ ํ์ํ๊ณ ๊ฐ ๋งค๊ฐ๋ณ์์ ๋ํ RRef๋ฅผ ์์ฑํ๊ณ |
| 117 | +``remote_parameters()``์์ ๋ฐํ๋ ๋ชฉ๋ก์ ์ถ๊ฐํจ์ผ๋ก์จ ์ํ๋ฉ๋๋ค. |
| 118 | +์ฐธ๊ณ ๋ก ``model.parameters()``๋ ์ฌ์ฉํ ์ ์์ต๋๋ค. ``RemoteModule``์์ ์ง์ํ์ง ์๋ ``model.remote_emb_module.parameters()``๋ฅผ ์ฌ๊ท์ ์ผ๋ก ํธ์ถํ๊ธฐ ๋๋ฌธ์
๋๋ค. |
| 119 | +
|
| 120 | +๋ง์ง๋ง์ผ๋ก ๋ชจ๋ RRef๋ฅผ ์ฌ์ฉํ์ฌ DistributedOptimizer๋ฅผ ๋ง๋ค๊ณ CrossEntropyLoss ํจ์๋ฅผ ์ ์ํฉ๋๋ค. |
136 | 121 |
|
137 | 122 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py |
138 | 123 | :language: py |
139 | 124 | :start-after: BEGIN setup_trainer |
140 | 125 | :end-before: END setup_trainer |
141 | 126 |
|
142 | | -Now we're ready to introduce the main training loop that is run on each trainer. |
143 | | -``get_next_batch`` is just a helper function to generate random inputs and |
144 | | -targets for training. We run the training loop for multiple epochs and for each |
145 | | -batch: |
| 127 | +์ด์ ๊ฐ ํธ๋ ์ด๋์์ ์คํ๋๋ ๊ธฐ๋ณธ ํ์ต ๋ฃจํ๋ฅผ ์๊ฐํ๊ฒ ์ต๋๋ค. |
| 128 | +``get_next_batch``๋ ํ์ต์ ์ํ ์์์ ์
๋ ฅ๊ณผ ๋์์ ์์ฑํ๋ ๊ฒ์ ๋์์ฃผ๋ ํจ์์ผ ๋ฟ์
๋๋ค. |
| 129 | +์ฌ๋ฌ ์ํญ(epoch)๊ณผ ๊ฐ ๋ฐฐ์น(batch)์ ๋ํด ํ์ต ๋ฃจํ๋ฅผ ์คํํฉ๋๋ค: |
146 | 130 |
|
147 | | -1) Setup a `Distributed Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__ |
148 | | - for Distributed Autograd. |
149 | | -2) Run the forward pass of the model and retrieve its output. |
150 | | -3) Compute the loss based on our outputs and targets using the loss function. |
151 | | -4) Use Distributed Autograd to execute a distributed backward pass using the loss. |
152 | | -5) Finally, run a Distributed Optimizer step to optimize all the parameters. |
| 131 | +1) ๋จผ์ ๋ถ์ฐ Autograd์ ๋ํด |
| 132 | + `๋ถ์ฐ Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__๋ฅผ ์ค์ ํฉ๋๋ค. |
| 133 | +2) ๋ชจ๋ธ์ ์๋ฐฉํฅ ์ ๋ฌ์ ์คํํ๊ณ ํด๋น ์ถ๋ ฅ์ ๊ฒ์(retrieve)ํฉ๋๋ค. |
| 134 | +3) ์์ค ํจ์๋ฅผ ์ฌ์ฉํ์ฌ ์ถ๋ ฅ๊ณผ ๋ชฉํ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์์ค์ ๊ณ์ฐํฉ๋๋ค. |
| 135 | +4) ๋ถ์ฐ Autograd๋ฅผ ์ฌ์ฉํ์ฌ ์์ค์ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์คํํฉ๋๋ค. |
| 136 | +5) ๋ง์ง๋ง์ผ๋ก ๋ถ์ฐ ์ตํฐ๋ง์ด์ ๋จ๊ณ๋ฅผ ์คํํ์ฌ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์ต์ ํํฉ๋๋ค. |
153 | 137 |
|
154 | 138 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py |
155 | 139 | :language: py |
156 | 140 | :start-after: BEGIN run_trainer |
157 | 141 | :end-before: END run_trainer |
158 | 142 | .. code:: python |
159 | 143 |
|
160 | | -Source code for the entire example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__. |
| 144 | +์ ์ฒด ์์ ์ ์์ค ์ฝ๋๋ `์ฌ๊ธฐ <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__์์ ์ฐพ์ ์ ์์ต๋๋ค. |
0 commit comments