|
| 1 | +# Cluster Training Benchmark |
| 2 | + |
| 3 | +## Setup |
| 4 | + |
| 5 | +- Platform |
| 6 | + - Kubernetes: v1.6.2 |
| 7 | + - Linux Kernel: v3.10.0 |
| 8 | + |
| 9 | +- Resource |
| 10 | + - CPU: 10 Cores per Pod |
| 11 | + - Memory: 5GB per Pod |
| 12 | + |
| 13 | +- Docker Image |
| 14 | + |
| 15 | + We use different base Docker Image to run the benchmark on Kubernetes: |
| 16 | + - PaddlePaddle v2: paddlepaddle/paddle:0.11.0 |
| 17 | + - PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id] |
| 18 | + - TensorFlow: tensorflow/tensorflow:1.5.0-rc0 |
| 19 | + |
| 20 | +- Model |
| 21 | + vgg16 is used in this benchmark. |
| 22 | + |
| 23 | +## Cases |
| 24 | + |
| 25 | +- Variable |
| 26 | + - Batch Size of training data. |
| 27 | + - PServer count of the training job. |
| 28 | + - The number of trainers. |
| 29 | + |
| 30 | +- Invariant |
| 31 | + - The resource of trainer/pserver Pod. |
| 32 | + |
| 33 | +### Measure the Performance for Different Batch Size |
| 34 | + |
| 35 | +- PServer Count: 40 |
| 36 | +- Trainer Count: 100 |
| 37 | +- Metrics: mini-batch / sec |
| 38 | + |
| 39 | +| Batch Size | 32 | 64 | 128 | 256 | |
| 40 | +| -- | -- | -- | -- | -- | |
| 41 | +| PaddlePaddle Fluid | - | - | - | - | |
| 42 | +| PaddlePaddle v2 | - | - | - | - | |
| 43 | +| TensorFlow | - | - | - | - | |
| 44 | + |
| 45 | +### Measure the Performance for Different PServer Count |
| 46 | + |
| 47 | +- Trainer Count: 100 |
| 48 | +- Batch Size: 64 |
| 49 | +- Metrics: mini-batch / sec |
| 50 | + |
| 51 | +| PServer Count | 10 | 20 | 40 | 60 | |
| 52 | +| -- | -- | -- | -- | -- | |
| 53 | +| PaddlePaddle Fluid | - | - | - | - | |
| 54 | +| PaddlePaddle v2 | - | - | - | - | |
| 55 | +| TensorFlow | - | - | - | - | |
| 56 | + |
| 57 | +### Measure Parallel Efficiency By Increasing Trainer Count |
| 58 | + |
| 59 | +- PServer Count: 20 |
| 60 | +- Batch Size: 64 |
| 61 | +- Metrics: |
| 62 | + |
| 63 | +$S = \div(T1, TN)$ |
| 64 | + |
| 65 | +which S is the ratio of T1 over TN, training time of 1 and N trainers. |
| 66 | +The parallel efficiency is: |
| 67 | + |
| 68 | +$E = \div(S, N)$ |
| 69 | + |
| 70 | +| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | |
| 71 | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | |
| 72 | +| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - | |
| 73 | +| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - | |
| 74 | +| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| 75 | + |
| 76 | +## Reproduce the benchmark |
| 77 | + |
| 78 | +TODO |
0 commit comments