Skip to content

Commit 7df74a5

Browse files
committed
Merge develop
2 parents 1083e99 + d82453f commit 7df74a5

File tree

4 files changed

+230
-16
lines changed

4 files changed

+230
-16
lines changed
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
# Distributed Training with NCCL2
22

33
We design a pattern that can enable training with `ParallelExecutor` and
4-
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
4+
use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
55
communication library.
66

77
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
88
to do multi GPU training. And if we initialize NCCL2 communicators as
99
ranks in a distributed environment, we can simply run the `ParallelExecutor`
1010
as a distributed program! The only thing that may be different than in
1111
the single node version is that we need to broadcast the NCCL unique ID
12-
to all the nodes, and initialize communicators using that ID, so NCCL2
13-
will know each other as ranks.
12+
to all the nodes and initialize communicators using that ID, so NCCL2
13+
can know each other as ranks.
1414

1515
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
1616
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
17-
what ever platform you like.
17+
whatever platform you like.
1818

19-
It have two running modes:
19+
It has two running modes:
2020

2121
1. Generate and broadcast mode, which should be used on trainer 0;
2222
1. Listen and fetch mode, which should be used on trainers other than 0.
@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
2929
<img src="src/ncc2_design.png">
3030

3131
The above figure indicates the general process when training with NCCL2
32-
distributed. Each trainer have the number of communicators equal to the
32+
distributed. Each trainer has the number of communicators equal to the
3333
number of GPUs, but the ranks should match the global ranks number: here
3434
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
3535
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.

doc/fluid/howto/cluster/nccl2_rdma_training.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Distributed Training with NCCL2 and RDMA
22

3-
When doing distributed multi-GPU training, network bandwith often becomes the
4-
bottle neck. We introduce a way to use NCCL2 to do such training job to
5-
achieve best performace.
3+
When doing distributed multi-GPU training, network bandwidth often becomes the
4+
bottleneck. We introduce a way to use NCCL2 to do such training job to
5+
achieve best performance.
66

7-
## Prepare Hardwares with RDMA and Multiple GPUs
7+
## Prepare Hardware with RDMA and Multiple GPUs
88

9-
I'm using two Linux servers each of them is installed with 8 GPUs and
9+
I'm using two Linux servers each of them installed with 8 GPUs and
1010
one 100Gb RDMA card.
1111
Base environment is:
1212

@@ -25,15 +25,15 @@ In general, the steps including:
2525
1. Use docker to run tests and make sure GPUs and RDMA can work inside
2626
the container.
2727

28-
I'll ommit section "Install GPU drivers" because we can find it easily
28+
I'll omit the section "Install GPU drivers" because we can find it easily
2929
somewhere else.
3030

3131
### Install RDMA drivers
3232

3333
For my case, I've got two machines with device
3434
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
3535
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
36-
work with latest overlay2 filesystem.
36+
work with the latest overlay2 filesystem.
3737

3838
***NOTE: before you start, make sure you have a way to get a console
3939
of the server other than ssh because we may need to re-configure the
@@ -45,22 +45,22 @@ network device.***
4545
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
4646
1. Run `/etc/init.d/openibd restart` to make everything work, note that
4747
this operation may cause the network goes down if you are using this
48-
RDMA device as default network device and use ssh to login the server.
48+
RDMA device as default network device and use ssh to log in the server.
4949
1. Re-configure the network interface, for example:
5050
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
5151
`ip route add default via 192.168.16.1 dev eth2`.
5252
1. Do the same thing on the other node.
5353
1. Use `ping` to test if the two nodes have typical ICMP connection.
5454
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
55-
ready and have the desired bandwith.
55+
ready and have the desired bandwidth.
5656

5757
### Prepare Docker Image to Run RDMA Programs
5858

5959
1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
6060
package in it.
6161
1. Start a docker container and mount GPU driver libs into it (you can
6262
skip this step if you are using nvidia-docker).
63-
1. Mount RDMA dirvers and libs into the docker image (see below section),
63+
1. Mount RDMA drivers and libs into the docker image (see below section),
6464
also `udaddy` and `ib_write_bw` if needed.
6565
1. Mount GPU devices and RDMA devices into the container using `--device`
6666
or just use privileged mode `--privileged`.
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import time
17+
import unittest
18+
from multiprocessing import Process
19+
import signal
20+
21+
import numpy
22+
23+
import paddle.fluid as fluid
24+
import paddle.fluid.layers as layers
25+
from paddle.fluid.layers.io import ListenAndServ
26+
from paddle.fluid.layers.io import Recv
27+
from paddle.fluid.layers.io import Send
28+
29+
from paddle.fluid.transpiler.details import program_to_code
30+
31+
32+
class TestProgram2Code(unittest.TestCase):
33+
def test_print(self):
34+
place = fluid.CPUPlace()
35+
self.init_serv(place)
36+
self.init_client(place, 9123)
37+
38+
def init_serv(self, place):
39+
main = fluid.Program()
40+
41+
with fluid.program_guard(main):
42+
serv = ListenAndServ("127.0.0.1:0", ["X"], optimizer_mode=False)
43+
with serv.do():
44+
out_var = main.global_block().create_var(
45+
name="scale_0.tmp_0",
46+
psersistable=True,
47+
dtype="float32",
48+
shape=[32, 32])
49+
x = layers.data(
50+
shape=[32, 32],
51+
dtype='float32',
52+
name="X",
53+
append_batch_size=False)
54+
fluid.initializer.Constant(value=1.0)(x, main.global_block())
55+
layers.scale(x=x, scale=10.0, out=out_var)
56+
57+
program_to_code(main)
58+
59+
def init_client(self, place, port):
60+
main = fluid.Program()
61+
with fluid.program_guard(main):
62+
x = layers.data(
63+
shape=[32, 32],
64+
dtype='float32',
65+
name='X',
66+
append_batch_size=False)
67+
fluid.initializer.Constant(value=2.3)(x, main.global_block())
68+
get_var = main.global_block().create_var(
69+
name="scale_0.tmp_0", # server side var
70+
dtype="float32",
71+
persistable=False,
72+
shape=[32, 32])
73+
fluid.initializer.Constant(value=2.3)(get_var, main.global_block())
74+
Send("127.0.0.1:%d" % port, [x])
75+
o = Recv("127.0.0.1:%d" % port, [get_var])
76+
77+
program_to_code(main)
78+
79+
80+
if __name__ == "__main__":
81+
unittest.main()

python/paddle/fluid/transpiler/details/program_utils.py

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@
1616

1717
import six
1818

19+
from paddle.fluid import core
20+
import paddle
21+
1922

2023
def delete_ops(block, ops):
2124
try:
@@ -39,3 +42,133 @@ def find_op_by_output_arg(block, arg_name):
3942
if arg_name in op.output_arg_names:
4043
return index
4144
return -1
45+
46+
47+
def get_indent_space(indent, space_num=4):
48+
ret = ""
49+
for i in range(0, indent * space_num):
50+
ret += " "
51+
52+
return ret
53+
54+
55+
def variable_to_code(var):
56+
"""
57+
Get readable codes of fluid variable.
58+
59+
Args:
60+
var: A fluid operator.
61+
62+
Returns:
63+
string: The formatted string.
64+
"""
65+
66+
var_str = "{name} : fluid.{type}.shape{shape}.astype({dtype})".\
67+
format(i="{", e="}", name=var.name, type=var.type, shape=var.shape, dtype=var.dtype)
68+
69+
if type(var) == paddle.fluid.framework.Parameter:
70+
if var.trainable:
71+
var_str = "trainable parameter " + var_str
72+
else:
73+
var_str = "parameter " + var_str
74+
else:
75+
var_str = "var " + var_str
76+
77+
if var.persistable:
78+
var_str = "persist " + var_str
79+
80+
return var_str
81+
82+
83+
def op_to_code(op):
84+
"""
85+
Get readable codes of fluid operator.
86+
87+
Args:
88+
op: A fluid operator.
89+
90+
Returns:
91+
string: The foramtted string.
92+
"""
93+
94+
outputs_str = "{"
95+
for i in range(0, len(op.output_names)):
96+
outputs_str += "{name}=".format(name=op.output_names[i])
97+
o = op.output(op.output_names[i])
98+
outputs_str += "{value}".format(value=o)
99+
if i != len(op.output_names) - 1:
100+
outputs_str += ", "
101+
outputs_str += "}"
102+
103+
inputs_str = "{"
104+
for i in range(0, len(op.input_names)):
105+
inputs_str += "{name}=".format(name=op.input_names[i])
106+
o = op.input(op.input_names[i])
107+
inputs_str += "{value}".format(value=o)
108+
109+
if i != len(op.input_names) - 1:
110+
inputs_str += ", "
111+
inputs_str += "}"
112+
113+
attrs_str = ""
114+
for i in range(0, len(op.attr_names)):
115+
name = op.attr_names[i]
116+
117+
attr_type = op.desc.attr_type(name)
118+
if attr_type == core.AttrType.BLOCK:
119+
a = "{name} = block[{value}]".format(
120+
name=name, type=attr_type, value=op.block_attr_id(name))
121+
attrs_str += a
122+
continue
123+
124+
if attr_type == core.AttrType.BLOCKS:
125+
a = "{name} = blocks{value}".format(
126+
name=name, type=attr_type, value=op.blocks_attr_ids(name))
127+
attrs_str += a
128+
continue
129+
130+
a = "{name} = {value}".format(
131+
name=name, type=attr_type, value=op.desc.attr(name))
132+
attrs_str += a
133+
if i != len(op.attr_names) - 1:
134+
attrs_str += ", "
135+
136+
if outputs_str != "{}":
137+
op_str = "{outputs} = {op_type}(inputs={inputs}, {attrs})".\
138+
format(outputs = outputs_str, op_type=op.type, inputs=inputs_str, attrs=attrs_str)
139+
else:
140+
op_str = "{op_type}(inputs={inputs}, {attrs})".\
141+
format(op_type=op.type, inputs=inputs_str, attrs=attrs_str)
142+
return op_str
143+
144+
145+
def program_to_code(prog):
146+
"""
147+
Print readable codes of fluid program.
148+
149+
Args:
150+
prog : A fluid program.
151+
152+
An example result like bellow:
153+
https://github.com/PaddlePaddle/Paddle/pull/12673
154+
"""
155+
indent = 0
156+
block_idx = 0
157+
for block in prog.blocks:
158+
print("{0}{1} // block {2}".format(
159+
get_indent_space(indent), '{', block_idx))
160+
indent += 1
161+
# sort all vars
162+
all_vars = sorted(block.vars.iteritems(), key=lambda x: x[0])
163+
for var in all_vars:
164+
print("{}{}".format(
165+
get_indent_space(indent), variable_to_code(var[1])))
166+
167+
if len(all_vars) > 0:
168+
print("")
169+
170+
for op in block.ops:
171+
print("{}{}".format(get_indent_space(indent), op_to_code(op)))
172+
indent -= 1
173+
print("{0}{1}".format(get_indent_space(indent), '}'))
174+
block_idx += 1

0 commit comments

Comments
 (0)