Skip to content

Commit 55d3951

Browse files
authored
Benchmark/Integrate benchmark scripts (#10707)
* wip integrate benchmark scripts * testing nlp models * k8s script to start dist benchmark job * update script * done support all models * add README.md * update by comment * clean up * follow comments
1 parent 530556d commit 55d3951

File tree

15 files changed

+1192
-1046
lines changed

15 files changed

+1192
-1046
lines changed

benchmark/fluid/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Fluid Benchmark
2+
3+
This directory contains several models configurations and tools that used to run
4+
Fluid benchmarks for local and distributed training.
5+
6+
7+
## Run the Benchmark
8+
9+
To start, run the following command to get the full help message:
10+
11+
```bash
12+
python fluid_benchmark.py --help
13+
```
14+
15+
Currently supported `--model` argument include:
16+
17+
* mnist
18+
* resnet
19+
* you can chose to use different dataset using `--data_set cifar10` or
20+
`--data_set flowers`.
21+
* vgg
22+
* stacked_dynamic_lstm
23+
* machine_translation
24+
25+
* Run the following command to start a benchmark job locally:
26+
```bash
27+
python fluid_benchmark.py --model mnist --parallel 1 --device GPU --with_test
28+
```
29+
You can choose to use GPU/CPU training. With GPU training, you can specify
30+
`--parallel 1` to run multi GPU training.
31+
* Run distributed training with parameter servers:
32+
* start parameter servers:
33+
```bash
34+
PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method pserver
35+
```
36+
* start trainers:
37+
```bash
38+
PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method pserver
39+
```
40+
* Run distributed training using NCCL2
41+
```bash
42+
PADDLE_PSERVER_PORT=7164 PADDLE_TRAINER_IPS=192.168.0.2,192.168.0.3 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method nccl2
43+
```
44+
45+
## Run Distributed Benchmark on Kubernetes Cluster
46+
47+
We provide a script `kube_gen_job.py` to generate Kubernetes yaml files to submit
48+
distributed benchmark jobs to your cluster. To generate a job yaml, just run:
49+
50+
```bash
51+
python kube_gen_job.py --jobname myjob --pscpu 4 --cpu 8 --gpu 8 --psmemory 20 --memory 40 --pservers 4 --trainers 4 --entry "python fluid_benchmark.py --model mnist --parallel 1 --device GPU --update_method pserver --with_test" --disttype pserver
52+
```
53+
54+
Then the yaml files are generated under directory `myjob`, you can run:
55+
56+
```bash
57+
kubectl create -f myjob/
58+
```
59+
60+
The job shall start.

0 commit comments

Comments
 (0)