Skip to content

Commit 5dbd537

Browse files
author
Yancey
authored
Fluid distributed training benchmark (#7410)
* add cluster training bencharmk design * update by comment * update by comment
1 parent 29b2693 commit 5dbd537

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

benchmark/cluster/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Cluster Training Benchmark
2+
3+
## Setup
4+
5+
- Platform
6+
- Kubernetes: v1.6.2
7+
- Linux Kernel: v3.10.0
8+
9+
- Resource
10+
- CPU: 10 Cores per Pod
11+
- Memory: 5GB per Pod
12+
13+
- Docker Image
14+
15+
We use different base Docker Image to run the benchmark on Kubernetes:
16+
- PaddlePaddle v2: paddlepaddle/paddle:0.11.0
17+
- PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id]
18+
- TensorFlow: tensorflow/tensorflow:1.5.0-rc0
19+
20+
- Model
21+
vgg16 is used in this benchmark.
22+
23+
## Cases
24+
25+
- Variable
26+
- Batch Size of training data.
27+
- PServer count of the training job.
28+
- The number of trainers.
29+
30+
- Invariant
31+
- The resource of trainer/pserver Pod.
32+
33+
### Measure the Performance for Different Batch Size
34+
35+
- PServer Count: 40
36+
- Trainer Count: 100
37+
- Metrics: mini-batch / sec
38+
39+
| Batch Size | 32 | 64 | 128 | 256 |
40+
| -- | -- | -- | -- | -- |
41+
| PaddlePaddle Fluid | - | - | - | - |
42+
| PaddlePaddle v2 | - | - | - | - |
43+
| TensorFlow | - | - | - | - |
44+
45+
### Measure the Performance for Different PServer Count
46+
47+
- Trainer Count: 100
48+
- Batch Size: 64
49+
- Metrics: mini-batch / sec
50+
51+
| PServer Count | 10 | 20 | 40 | 60 |
52+
| -- | -- | -- | -- | -- |
53+
| PaddlePaddle Fluid | - | - | - | - |
54+
| PaddlePaddle v2 | - | - | - | - |
55+
| TensorFlow | - | - | - | - |
56+
57+
### Measure Parallel Efficiency By Increasing Trainer Count
58+
59+
- PServer Count: 20
60+
- Batch Size: 64
61+
- Metrics:
62+
63+
$S = \div(T1, TN)$
64+
65+
which S is the ratio of T1 over TN, training time of 1 and N trainers.
66+
The parallel efficiency is:
67+
68+
$E = \div(S, N)$
69+
70+
| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
71+
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
72+
| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - |
73+
| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - |
74+
| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - |
75+
76+
## Reproduce the benchmark
77+
78+
TODO

0 commit comments

Comments
 (0)