Skip to content

Commit a529d79

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into gen_nccl_id_op
2 parents 82c61db + b708ec0 commit a529d79

File tree

156 files changed

+4732
-2146
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

156 files changed

+4732
-2146
lines changed

.travis.yml

Lines changed: 1 addition & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -16,34 +16,14 @@ env:
1616
- JOB=check_style
1717
- JOB=build_android
1818
addons:
19-
apt:
20-
packages:
21-
- gcc-4.8
22-
- g++-4.8
23-
- git
24-
- build-essential
25-
- python
26-
- python-pip
27-
- python2.7-dev
28-
- python-wheel
29-
- libboost-dev
30-
- curl
31-
- swig
32-
- graphviz
33-
- clang-format-3.8
34-
- automake
35-
- libtool
36-
- ccache
3719
ssh_known_hosts: 13.229.163.131
3820
before_install:
39-
- sudo pip install -r $TRAVIS_BUILD_DIR/python/requirements.txt
40-
- sudo pip install wheel sphinx==1.5.6 recommonmark sphinx-rtd-theme==0.1.9 virtualenv pre-commit
4121
- |
4222
function timeout() { perl -e 'alarm shift; exec @ARGV' "$@"; }
4323
script:
4424
- |
4525
# 43min timeout
46-
if [[ "$JOB" != "doc" ]]; then timeout 2580 paddle/scripts/paddle_docker_build.sh ${JOB}; else paddle/scripts/paddle_build.sh ${JOB}; fi;
26+
paddle/scripts/paddle_docker_build.sh ${JOB}
4727
if [ $? -eq 0 ] || [ $? -eq 142 ]; then true; else exit 1; fi;
4828
- |
4929
if [[ "$JOB" != "doc" ]]; then exit 0; fi;

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,19 +75,19 @@ We provide [English](http://www.paddlepaddle.org/docs/develop/documentation/en/g
7575

7676
You might want to start from this online interactive book that can run in a Jupyter Notebook.
7777

78-
- [Distributed Training](http://www.paddlepaddle.org/docs/develop/documentation/en/howto/usage/cluster/cluster_train_en.html)
78+
- [Distributed Training](http://www.paddlepaddle.org/docs/develop/documentation/en/howto/cluster/index_en.html)
7979

8080
You can run distributed training jobs on MPI clusters.
8181

82-
- [Distributed Training on Kubernetes](http://www.paddlepaddle.org/docs/develop/documentation/en/howto/usage/cluster/k8s_en.html)
82+
- [Distributed Training on Kubernetes](http://www.paddlepaddle.org/docs/develop/documentation/en/howto/cluster/multi_cluster/k8s_en.html)
8383

8484
You can also run distributed training jobs on Kubernetes clusters.
8585

86-
- [Python API](http://www.paddlepaddle.org/docs/develop/documentation/en/api/index_en.html)
86+
- [Python API](http://www.paddlepaddle.org/docs/develop/api/en/overview.html)
8787

8888
Our new API enables much shorter programs.
8989

90-
- [How to Contribute](http://www.paddlepaddle.org/docs/develop/documentation/en/howto/dev/contribute_to_paddle_en.html)
90+
- [How to Contribute](http://www.paddlepaddle.org/docs/develop/documentation/fluid/en/dev/contribute_to_paddle_en.html)
9191

9292
We appreciate your contributions!
9393

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
3+
# Update to point to the source file.
4+
VGG_SRC="vgg16_fluid.py"
5+
6+
export TRAINING_ROLE=PSERVER
7+
export TRAINERS=2
8+
export POD_IP=127.0.0.1
9+
export PADDLE_INIT_PORT=6174
10+
MKL_NUM_THREADS=1 python -u ${VGG_SRC} --local 0 --ps_host=127.0.0.1:6174 --trainer_hosts=127.0.0.1:6174 &
11+
12+
# Need to wait for the ps to start first.
13+
sleep 10
14+
echo "done start ps"
15+
16+
export TRAINING_ROLE=TRAINER
17+
export TRAINERS=2
18+
export POD_IP=127.0.0.1
19+
export PADDLE_INIT_PORT=6174
20+
CUDA_VISIBLE_DEVICES=4 MKL_NUM_THREADS=1 python -u ${VGG_SRC} --local 0 --ps_host=127.0.0.1:6174 --trainer_hosts=127.0.0.1:6174 --device=GPU --task_index=0 &
21+
CUDA_VISIBLE_DEVICES=5 MKL_NUM_THREADS=1 python -u ${VGG_SRC} --local 0 --ps_host=127.0.0.1:6174 --trainer_hosts=127.0.0.1:6174 --device=GPU --task_index=1 &

benchmark/cluster/vgg16/vgg16_fluid.py

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -200,18 +200,19 @@ def train_loop(exe, trainer_prog):
200200
num_samples += len(data)
201201
train_pass_acc.add(value=acc, weight=b_size)
202202
print(
203-
"Pass = %d, Iters = %d, Loss = %f, Accuracy = %f, Speed = %.2f img/s"
204-
% (pass_id, iters, loss, acc,
205-
len(data) / (time.time() - ts))
203+
"Task:%d Pass = %d, Iters = %d, Loss = %f, Accuracy = %f, "
204+
"Speed = %.2f img/s " % (args.task_index, pass_id, iters,
205+
loss, acc,
206+
len(data) / (time.time() - ts))
206207
) # The accuracy is the accumulation of batches, but not the current batch.
207208

208209
pass_elapsed = time.time() - start_time
209210
pass_train_acc = train_pass_acc.eval()
210211
pass_test_acc = test(exe)
211-
print(
212-
"Pass = %d, Training performance = %f imgs/s, Train accuracy = %f, Test accuracy = %f\n"
213-
% (pass_id, num_samples / pass_elapsed, pass_train_acc,
214-
pass_test_acc))
212+
print("Task:%d Pass = %d, Training performance = %f imgs/s, "
213+
"Train accuracy = %f, Test accuracy = %f\n" %
214+
(args.task_index, pass_id, num_samples / pass_elapsed,
215+
pass_train_acc, pass_test_acc))
215216

216217
if args.local:
217218
# Parameter initialization
@@ -239,8 +240,6 @@ def train_loop(exe, trainer_prog):
239240

240241
t = fluid.DistributeTranspiler()
241242
t.transpile(
242-
optimize_ops,
243-
params_grads,
244243
trainer_id=args.task_index,
245244
pservers=args.ps_hosts,
246245
trainers=trainers)

contrib/float16/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.inference.model

contrib/float16/float16_benchmark.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# float16 benchmark
2+
3+
## Description
4+
We want to compare the inference benchmark of float16 vs float32 on the "image_classification" example on Nvidia Tesla V100 GPU, where we can enable the tensor core computation for float16 mode. We test Vgg16 and Resnet50 on the imagenet data set, and Vgg16 and Resnet32 on the cifar10 data set. For completeness, we also add the inference benchmark of Vgg16 and Resnet50 on imagenet data set tested on Nvidia GeForce GTX 1080 Ti GPU.
5+
6+
For more details about tensor core, please refer to https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
7+
8+
## Test environment
9+
- GPU: single Nvidia Tesla V100 or single Nvidia GeForce GTX 1080 Ti
10+
- CUDNN: 7.1.1
11+
- CUDA: 9.0
12+
- Code: https://github.com/PaddlePaddle/Paddle/pull/10331 (Tensor core is enabled in float16 mode)
13+
14+
## Benchmark on V100
15+
All times are in ms (millisecond) averaged over 1000 iterations tested on a single Nvidia V100 GPU with respective to different mini-batch(mb) sizes.
16+
17+
### Vgg16 on imagenet (flowers data set: image.shape = [3, 224, 224]):
18+
19+
Total inference time for one batch:
20+
21+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 |
22+
|-------|-----: |-----: |-----: |-----: |------: |------:|-------:|
23+
|float32| 14.01 | 9.70 | 22.99 | 28.26 | 53.87 | 84.42 | 178.95 |
24+
|float16| 3.32 | 4.11 | 5.88 | 9.41 | 16.54 | 30.47 | 60.23 |
25+
|Speedup| 4.22 | 2.36  | 3.91 | 3.00 | 3.26  | 2.77 | 2.97 |
26+
27+
Total time spent on conv op for one batch:
28+
29+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 |
30+
|-------|-----: |-----: |-----: |-----: |------: |------:|-------:|
31+
|float32| 11.95 | 6.96 | 18.65 | 21.42 | 41.35 | 60.58 | 130.11 |
32+
|float16| 1.78 | 2.10 | 2.93 | 4.55 | 7.99 | 14.63 | 28.67 |
33+
|Speedup| 6.71 | 3.31  | 6.37 | 4.71 | 5.18  | 4.14 | 4.54 |
34+
35+
36+
### Resnet50 on imagenet (flowers data set: image.shape = [3, 224, 224]):
37+
38+
Total inference time for one batch:
39+
40+
|       | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 | mb=128 |
41+
|-------|-----: |-----: |-----: |-----: |------: |------:|-------:|-------:|
42+
|float32| 7.03 | 7.41 | 9.16 | 12.55 | 21.13 | 38.27 | 67.93 | 127.02 |
43+
|float16| 6.13 | 6.32 | 6.24 | 7.40 | 10.90 | 18.18 | 33.20 | 64.52 |
44+
|Speedup| 1.15 | 1.17  | 1.47  | 1.70 | 1.94  | 2.11 | 2.05 | 1.97 |
45+
46+
Total time spent on conv op for one batch:
47+
48+
|       | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 | mb=128 |
49+
|-------|-----: |-----: |-----: |-----: |------: |------:|-------:|-------:|
50+
|float32| 5.43 | 5.46 | 6.50 | 8.36 | 13.80 | 24.45 | 41.21 | 73.44 |
51+
|float16| 4.19 | 4.30 | 3.96 | 4.21 | 5.63 | 8.77 | 15.24 | 28.40 |
52+
|Speedup| 1.30 | 1.27  | 1.64  | 1.99 | 2.45  | 2.79 | 2.70 | 2.59 |
53+
54+
55+
### Vgg16 on cifar10 (image.shape = [3, 32, 32]):
56+
57+
Total inference time for one batch:
58+
59+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 | mb=128 | mb=256 | mb=512 |
60+
|-------|-----:|-----:|-----:|-----:|------:|------:|------:|-------:|-------:|-------:|
61+
|float32| 3.13 | 3.17 | 3.19 | 3.58 | 3.98 | 6.23 | 8.42 | 13.44 | 24.19 | 44.97 |
62+
|float16| 2.72 | 2.77 | 2.76 | 2,88 | 2.96 | 3.24 | 4.01 | 5.78 | 9.65 | 17.37 |
63+
|Speedup| 1.15 | 1.14 | 1.16 | 1.24 | 1.34 | 1.92  | 2.10 | 2.33  | 2.51 | 2.59 |
64+
65+
66+
### Resnet32 on cifar10 (image.shape = [3, 32, 32]):
67+
68+
Total inference time for one batch:
69+
70+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 | mb=128 | mb=256 | mb=512 |
71+
|-------|-----:|-----:|-----:|-----:|------:|------:|------:|-------:|-------:|-------:|
72+
|float32| 3.11 | 3.14 | 2.99 | 3.04 | 3.10 | 3.28 | 4.47 | 6.86 | 11.63 | 21.16 |
73+
|float16| 3.70 | 3.81 | 3.75 | 3.83 | 3.77 | 3.97 | 3.92 | 4.15 | 6.41 | 11.02 |
74+
|Speedup|     |     |     |     |       | | 1.14  | 1.65 | 1.81 | 1.92 |
75+
76+
77+
## Benchmark on 1080 Ti
78+
All times are in ms (millisecond) averaged over 1000 iterations tested on a single Nvidia GeForce GTX 1080 Ti GPU with respective to different mini-batch(mb) sizes.
79+
80+
### Vgg16 on imagenet (flowers data set: image.shape = [3, 224, 224]):
81+
Total inference time for one batch:
82+
83+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 |
84+
|-------|-----: |-----: |-----: |-----: |------: |-------:|
85+
|float32| 5.60 | 9.38 | 15.86 | 29.79 | 57.60 | 117.73 |
86+
|float16| 4.99 | 7.79 | 13.47 | 26.02 | 52.30 | 102.34 |
87+
|Speedup| 1.12 | 1.20  | 1.18 | 1.15 | 1.10  | 1.15 |
88+
89+
90+
### Resnet50 on imagenet (flowers data set: image.shape = [3, 224, 224]):
91+
Total inference time for one batch:
92+
93+
| | mb=1 | mb=2 | mb=4 | mb=8 | mb=16 | mb=32 | mb=64 |
94+
|-------|-----: |-----: |-----: |-----: |------: |-------:|-------:|
95+
|float32| 5.63 | 6.23 | 8.85 | 14.71 | 26.07 | 52.86 | 108.95 |
96+
|float16| 5.89 | 6.44 | 7.94 | 12.57 | 22.03 | 45.06 | 92.68 |
97+
|Speedup| |  | 1.12  | 1.17 | 1.18  | 1.17 | 1.18 |

0 commit comments

Comments
 (0)