Skip to content

Commit 3b6d678

Browse files
authored
Merge pull request #9638 from putcn/aws-benchmark
aws benchmarking tool
2 parents 3fbe9c3 + 1e7c69f commit 3b6d678

File tree

11 files changed

+1268
-0
lines changed

11 files changed

+1268
-0
lines changed

tools/aws_benchmarking/README.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# AWS benchmark testing tool
2+
This is an automation tool for deploying paddlepaddle benchmark testing to AWS.
3+
4+
## Features
5+
6+
- subnet creation to fit just the amount of ec2 instances required.
7+
- pserver and trainer ec2 instances allocation, and instance state verification
8+
- nvidia-docker ready for GPU training
9+
- Instances and network element garbage collection when a task is accomplished or an error occurred
10+
- Test log is collected in realtime
11+
- Web service for checking log or tearing down the testing setup
12+
- No testing code change needed
13+
- Lots of optional configuration options
14+
15+
## Usages
16+
17+
### Prerequisites
18+
19+
- You have a working AWS account
20+
- You have [AWS Command Line Interface](https://aws.amazon.com/cli/) installed
21+
- Your AWS cli is bind with a account which has `AmazonEC2FullAccess` permission, and it's set as default credential.
22+
- You have key pair created and pem file downloaded.
23+
- You have a default VPC in the region you want to run the test.
24+
- You have a Security Group created for the VPC mentioned above, which allows port 22 and the port you want to expose your control web service (5436 by default)
25+
- If your test is supposed to run in a GPU machine, especially a multi card GPU machine (p2, p3 series), you might need to contact amazon to raise the limit which allows no more than 1 GPU instance at a time.
26+
27+
### Start a benchmark test
28+
29+
#### Create training image
30+
31+
*What to expect in this step:*
32+
33+
*You will have your training logic packed with paddle runtime in a docker image, and be able to be picked up by AWS instance for training.*
34+
35+
Training python script and PaddlePaddle runtime are supposed to be packed into one docker image. Use PaddlePaddle production images as base image and create the training images with the docker file as follows:
36+
37+
```Dockerfile
38+
FROM paddlepaddle/paddle:latest-gpu
39+
40+
ENV HOME /root
41+
COPY ./ /root/
42+
WORKDIR /root
43+
RUN pip install -r /root/requirements.txt
44+
ENTRYPOINT ["python", "my_training.py"]
45+
```
46+
47+
***Please Note***
48+
Training nodes will run your `ENTRYPOINT` script with the following environment variables:
49+
50+
- `TASK_NAME`: unique name to identify this training process.
51+
- `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
52+
- `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"
53+
- `PSERVERS`: same as above
54+
- `TRAINERS`: trainer count
55+
- `SERVER_ENDPOINT`: current server end point if the node role is a pserver
56+
- `TRAINER_INDEX`: an integer to identify the index of current trainer if the node role is a trainer.
57+
- `PADDLE_INIT_TRAINER_ID`: same as above
58+
59+
Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command:
60+
61+
```bash
62+
docker build -t myreponname/paddle_benchmark .
63+
```
64+
65+
Now you have the image built and tagged with `myreponame/paddle_benchmark`, let's push it to dockerhub so that it can be picked up by out AWS instance.
66+
67+
```bash
68+
docker push myreponame/paddle_benchmark
69+
```
70+
71+
#### Create instances and start training
72+
73+
*What to expect in this step*
74+
75+
*you will be asked to provide some basic settings to config your training, and this tool will have your training started and monitored*
76+
77+
Now let's start the training process:
78+
79+
```bash
80+
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/root/<key pare name>.pem \
81+
putcn/paddle_aws_client \
82+
--action create \
83+
--key_name <your key pare name> \
84+
--security_group_id <your security group id> \
85+
--docker_image myreponame/paddle_benchmark \
86+
--pserver_count 2 \
87+
--trainer_count 2
88+
```
89+
90+
Now just wait until you see this:
91+
```
92+
master server finished init process, visit http://XXX:XXX/status to check master log
93+
```
94+
That means you can turn off your laptop and your cluster is creating instances, starting training process, collecting logs and eventually shut all pservers and trainers down when training is finished.
95+
96+
#### Post creation operations
97+
98+
To access the master log:
99+
100+
```bash
101+
docker run -i -v $HOME/.aws:/root/.aws \
102+
putcn/paddle_aws_client \
103+
--action status \
104+
--master_server_public_ip <master ip> \
105+
--master_server_port <master port>
106+
```
107+
108+
To tear down the training setup:
109+
110+
```bash
111+
docker run -i -v $HOME/.aws:/root/.aws \
112+
putcn/paddle_aws_client \
113+
--action cleanup \
114+
--master_server_public_ip <master ip> \
115+
--master_server_port <master port>
116+
```
117+
118+
To retrieve training logs
119+
TBD
120+
121+
### Tech details
122+
123+
*What to expect in this step*
124+
125+
*You will understand what is happening behind the scene, and how to check the training log, how to tear down the training on the fly, etc.*
126+
127+
Let's understand what is happening under the hood when you run above command in your laptop
128+
129+
![alt](diagram.png)
130+
131+
There are 4 roles in the figure above:
132+
- client: your laptop
133+
- master: who tasks to aws api server to create/tear down instances, and monitor training process
134+
- AWS api server: the one who actually creates and manages instances
135+
- pservers and trainers: training instances
136+
137+
When you run the `docker run` command above, what it actually does is to ask aws api service to create a subnet (step 1) and a master instance (step 2), and pass all the parameters the client collected or generated (step 3). The master is kept as minimum hardware config to keep the running cost low.
138+
139+
Then when the master is up and running, it will ask the aws api server to create the heavy lifting training instances who are expensive to run (step 4). And the master will start training process as soon as they are done initializing (step 5).
140+
141+
Meanwhile, the master will expose a web service for client to check training log or even tear the training setup down by a web service call.
142+
143+
if you are creating the training with client docker container, and also monitoring your aws dashboard, you will initially see a instance tagged with `ROLE=MASTER` and `TASK_NAME=<yourtask name>_master` starts, then you will see several instances tagged with `ROLE=PSERVER` and `ROLE=TRAINER` starts.
144+
When the training is finished, pservers and trainers will be terminated. All their logs are kept in master node's docker env.
145+
146+
Master exposes 4 major services:
147+
148+
- GET `/status`: return master log
149+
- GET `/logs`: return list of log file names
150+
- GET `/log/<logfile name>`: return a particular log by log file name
151+
- POST `/cleanup`: teardown the whole setup
152+
153+
154+
### Parameters
155+
156+
TBD, please refer to client/cluster_launcher.py for now
157+
158+
### Trouble shooting
159+
160+
TBD
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
FROM python:2.7.14-stretch
2+
3+
ENV HOME /root
4+
COPY ./ /root/
5+
WORKDIR /root
6+
RUN pip install -r /root/requirements.txt
7+
ENTRYPOINT ["python", "cluster_launcher.py"]

0 commit comments

Comments
 (0)