First, on your local machine:
- Make sure Python 3 is installed on the local machine. Then install Ray version
1.3and boto with:pip install ray==1.3 boto3
- Configure your AWS credentials (
aws_access_key_idandaws_secret_access_key) in~/.aws/credentialsas described here. Your~/.aws/credentialsshould look like the following:Change the permission of this file:[default] aws_access_key_id=XXXXXXXX aws_secret_access_key=YYYYYYYYchmod 600 ~/.aws/credentials
Please contact Siyuan Zhuang (s.z@berkeley.edu) for configured AMI. See following for the instructions to setup the cluster from scratch:
Start an AWS node with initial.yaml and connect to the node:
ray up initial.yaml
ray attach initial.yaml # ssh into the AWS instanceSome experiments require a shared files system for proper logging. Here use AWS EFS.
- Create an EFS on region
us-east-1. This is used as an NFS for all nodes in the cluster. - Check your created EFS on https://console.aws.amazon.com/efs/home?region=us-east-1#/file-systems/. You can see the EFS File system ID ("fs-********") on the page.
- Please add the security group ID of the node you just started (can be found on the AWS Management Console) to the EFS to make sure your node can access the EFS (link to manage EFS network access: https://console.aws.amazon.com/efs/home?region=us-east-1#/file-systems/{Your EFS file system ID}/network-access).
You should have sshed into an AWS instance now, the following commands are executed on the AWS instance:
-
Install the efs-utils to mount the EFS on the node:
git clone https://github.com/aws/efs-utils cd efs-utils ./build-deb.sh sudo apt-get -y install ./build/amazon-efs-utils*deb
It is normal to see
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable) E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?when installing the package. This means the machine is still booting up and installing packages, you need to wait until the package manager is ready (usually 2-4 min) and install again.
Then mount the EFS on the node by:
mkdir -p ~/efs sudo mount -t efs {Your EFS file system ID}:/ ~/efs sudo chmod 777 ~/efs
If this takes forever or connection timeout, make sure you configure the sercurity groups right.
-
Install dependancies, clone Hoplite, and then compile Hoplite:
# You **must** the repo under EFS. cd ~/efs && git clone https://github.com/suquark/hoplite.git cd hoplite ./install_dependencies.sh mkdir build cd build cmake -DCMAKE_BUILD_TYPE=Release .. make -j
Note that Hoplite should be compiled before activating conda environment, otherwise the Protobuf library in the conda environment will cause compilation errors.
-
Activate conda environment:
conda activate echo "conda activate" >> ~/.bashrc
-
Install python libraries:
pip install 'ray[all]==1.3' 'ray[serve]==1.3' torchvision==0.8.2 mpi4py efficientnet_pytorch
-
Install Hoplite Python library:
cd ~/efs/hoplite pip install -e python cp build/notification python/hoplite/ ./python/setup.sh
-
Config ssh for MPI:
echo -e "Host *\n StrictHostKeyChecking no" >> ~/.ssh/config sudo chmod 400 ~/.ssh/config
-
Setup ssh key:
ssh-keygen cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-
Create an AMI on AWS console. See EC2 -> Instances -> Actions -> Image and templates -> Create image. Set the image name (e.g.
hoplite-artifact-ami) and then create image. -
Go to AMIs tab on AWS console. When the AMI is ready, turn off the instance via:
ray down initial.yaml
- Create a placement group on the AWS Management Console. See EC2 -> Placement Groups. Choose the
Clusterplacement strategy. This can make sure the interconnection bandwidth among different nodes in the cluster are high. - Replace the
{image-id}incluster.yamlwith the AMI-id you just created and{group-name}with the placement group name you just created. - Replace
{efs-id}with your EFS file system ID. - Replace
SecurityGroupIdswith the security ID created byinitial.yaml. - Start the cluster and connect to the head node via:
If the node fails to connect to EFS (or the cluster takes forever to spin up), check if the security group ID in EFS as mentioned earlier.
ray up cluster.yaml ray attach cluster.yaml
If everything is ok, take down the cluster using ray down cluster.yaml and remember to save your cluster.yaml.
Here is an example of configured cluster file.