|
| 1 | +# Fluid Benchmark |
| 2 | + |
| 3 | +This directory contains several models configurations and tools that used to run |
| 4 | +Fluid benchmarks for local and distributed training. |
| 5 | + |
| 6 | + |
| 7 | +## Run the Benchmark |
| 8 | + |
| 9 | +To start, run the following command to get the full help message: |
| 10 | + |
| 11 | +```bash |
| 12 | +python fluid_benchmark.py --help |
| 13 | +``` |
| 14 | + |
| 15 | +Currently supported `--model` argument include: |
| 16 | + |
| 17 | +* mnist |
| 18 | +* resnet |
| 19 | + * you can chose to use different dataset using `--data_set cifar10` or |
| 20 | + `--data_set flowers`. |
| 21 | +* vgg |
| 22 | +* stacked_dynamic_lstm |
| 23 | +* machine_translation |
| 24 | + |
| 25 | +* Run the following command to start a benchmark job locally: |
| 26 | + ```bash |
| 27 | + python fluid_benchmark.py --model mnist --parallel 1 --device GPU --with_test |
| 28 | + ``` |
| 29 | + You can choose to use GPU/CPU training. With GPU training, you can specify |
| 30 | + `--parallel 1` to run multi GPU training. |
| 31 | +* Run distributed training with parameter servers: |
| 32 | + * start parameter servers: |
| 33 | + ```bash |
| 34 | + PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method pserver |
| 35 | + ``` |
| 36 | + * start trainers: |
| 37 | + ```bash |
| 38 | + PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method pserver |
| 39 | + ``` |
| 40 | +* Run distributed training using NCCL2 |
| 41 | + ```bash |
| 42 | + PADDLE_PSERVER_PORT=7164 PADDLE_TRAINER_IPS=192.168.0.2,192.168.0.3 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --parallel 0 --device GPU --update_method nccl2 |
| 43 | + ``` |
| 44 | + |
| 45 | +## Run Distributed Benchmark on Kubernetes Cluster |
| 46 | + |
| 47 | +We provide a script `kube_gen_job.py` to generate Kubernetes yaml files to submit |
| 48 | +distributed benchmark jobs to your cluster. To generate a job yaml, just run: |
| 49 | + |
| 50 | +```bash |
| 51 | +python kube_gen_job.py --jobname myjob --pscpu 4 --cpu 8 --gpu 8 --psmemory 20 --memory 40 --pservers 4 --trainers 4 --entry "python fluid_benchmark.py --model mnist --parallel 1 --device GPU --update_method pserver --with_test" --disttype pserver |
| 52 | +``` |
| 53 | + |
| 54 | +Then the yaml files are generated under directory `myjob`, you can run: |
| 55 | + |
| 56 | +```bash |
| 57 | +kubectl create -f myjob/ |
| 58 | +``` |
| 59 | + |
| 60 | +The job shall start. |
0 commit comments