Skip to content

Commit e20a057

Browse files
committed
add parameter section and minor fixes
1 parent 504e60a commit e20a057

File tree

1 file changed

+27
-4
lines changed

1 file changed

+27
-4
lines changed

tools/aws_benchmarking/README.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
7777
Now let's start the training process:
7878

7979
```bash
80-
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/root/<key pare name>.pem \
80+
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/root/<key pair name>.pem \
8181
putcn/paddle_aws_client \
8282
--action create \
83-
--key_name <your key pare name> \
83+
--key_name <your key pair name> \
8484
--security_group_id <your security group id> \
8585
--docker_image myreponame/paddle_benchmark \
8686
--pserver_count 2 \
@@ -154,8 +154,31 @@ Master exposes 4 major services:
154154

155155
### Parameters
156156

157-
TBD, please refer to client/cluster_launcher.py for now
157+
- key_name: required, aws key pair name
158+
- security_group_id: required, the security group id associated with your VPC
159+
- vpc_id: The VPC in which you wish to run test, if not provided, this tool will use your default VPC.
160+
- subnet_id: The Subnet_id in which you wish to run test, if not provided, this tool will create a new sub net to run test.
161+
- pserver_instance_type: your pserver instance type, c5.2xlarge by default, which is a memory optimized machine.
162+
- trainer_instance_type: your trainer instance type, p2.8xlarge by default, which is a GPU machine with 8 cards.
163+
- task_name: the name you want to identify your job, if not provided, this tool will generate one for you.
164+
- pserver_image_id: ami id for system image. Please note, although the default one has nvidia-docker installed, pserver is always launched with `docker` instead of `nvidia-docker`, please DO NOT init your training program with GPU place.
165+
- pserver_command: pserver start command, format example: python,vgg.py,batch_size:128,is_local:no, which will be translated as `python vgg.py --batch_size 128 --is_local no` when trying to start the training in pserver. "--device CPU" is passed as default.
166+
- trainer_image_id: ami id for system image, default one has nvidia-docker ready.
167+
- trainer_command: trainer start command. Format is the same as pserver's, "--device GPU" is passed as default.
168+
- availability_zone: aws zone id to place ec2 instances, us-east-2a by default.
169+
- trainer_count: Trainer count, 1 by default.
170+
- pserver_count: Pserver count, 1 by default.
171+
- action: create|cleanup|status, "create" by default.
172+
- pserver_port: the port for pserver to open service, 5436 by default.
173+
- docker_image: the training docker image id.
174+
- master_service_port: the port for master to open service, 5436 by default.
175+
- master_server_public_ip: the master service ip, this is required when action is not "create"
176+
- master_docker_image: master's docker image id, "putcn/paddle_aws_master:latest" by default
177+
- no_clean_up: no instance termination when training is finished or failed when this value is set "yes". This is for debug purpose, so that you can inspect into the instances when the process is finished.
178+
158179

159180
### Trouble shooting
160181

161-
TBD
182+
1. How to check logs
183+
184+
Master log is served at `http://<masterip>:<masterport>/status`, and you can list all the log files from `http://<masterip>:<masterport>/logs`, and access either one of them by `http://<masterip>:<masterport>/log/<logfilename>`

0 commit comments

Comments
 (0)