-
Notifications
You must be signed in to change notification settings - Fork 2
Setting up: Spark cluster
The Spark cluster for this project consists of 1 master node and 8 worker nodes (i.e., 9 AWS EC2 instances). Before starting with this setup, it was helpful to create a personal list of cluster nodes' public* and private IPs for personal reference.
| Name | Instances | Description | Resource |
|---|---|---|---|
| 1 Master | m5.4xlarge | Handles resource allocation | 16 vCPUs, 64GB memory, 8GB storage |
| 8 Workers | m5.2xlarge | Run executors that do the work | 8 vCPUs, 32GB memory, 16GB storage (per node) |
These notes largely follow this great tutorial, with adjustments to my particular computing architecture.
[Note] My PostgreSQL database setup occurred on this same instance, with an additional volume added.
Create a security group for the cluster, and add all nodes to the group. The security settings should
- Allow nodes in the cluster to communicate with each other
- Allow nodes in the cluster to download from the Internet (for updating packages and such)
- Allow us to SSH into them to do installations
- Open ports 8080 (for the web UI) and 7071 (for the master)
| Type | Protocol | Port Range | Source | Description |
|---|---|---|---|---|
| Custom TCP Rule | TCP | 8080 | < your IP > | access Spark admin UI |
| All traffic | all | all | sg-xxx (security group ID) | for node X to talk to node Y |
| Custom TCP Rule | TCP | 7077 | < your IP > | Spark port |
| Custom TCP Rule | TCP | 4040 | < your IP > | Spark context web UI |
| SSH | TCP | 22 | < your IP > | SSH into machine |
| Type | Protocol | Port Range | Source | Description |
|---|---|---|---|---|
| All traffic | All | All | sg-xxx (security group ID) | for node X to receive data from node Y |
| All traffic | All | All | 0.0.0.0/0 | all download from external websites |
- Install Java 8 (the version supported by Spark 2.4.4) and Scala.
# update package list
sudo apt update
# install Java 8
sudo apt install openjdk-8-jre-headless
# confirm the version
java -version
# install scala
sudo apt install scala
# check the version
scala -version- Setup keyless SSH for easier communication with the workers, by running the following code on the master only.
sudo apt install openssh-server openssh-client
# create an RSA key pair
cd ~/.ssh
ssh-keygen -t rsa -P ""
# ...when asked for a file name, enter: id_rsa- Install Spark 2.4.4.
# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz
# move the files
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark
# edit the file `~/.bash_profile` and add the following text:
export PATH=/usr/local/spark/bin:$PATH
# load the new configuration
source ~/.bash_profileThis process should be repeated for each worker node.
- Install Java 8 (version supported by Spark 2.4.4) and Scala.
# update package list
sudo apt update
# install Java 8
sudo apt install openjdk-8-jre-headless
# check the version
java -version
# install scala
sudo apt install scala
# check the version
scala -version- Complete setup of keyless SSH.
- Manually copy the contents of the
~/.ssh/id_rsa.pubfile (located on the master node) into the~/.ssh/authorized_keysfile in each worker.- The entire contents should be in one line.
- These contents start with
ssh-rsaand end withubuntu@some_ip; runcat ~/.ssh/id_rsa.pubon the master node to see them. - If the
~/.ssh/authorized_keysfile already contains stuff, paste these new contents on top and separate with a line break.
- Test the connection by using the master to SSH into the worker, using its private IP, e.g. for Worker 1:
ssh -i ~/.ssh/id_rsa ubuntu@<worker-privateIP>- Install Spark 2.4.4.
# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz
# move the files
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark
# edit the file `~/.bash_profile` and add the following text:
export PATH=/usr/local/spark/bin:$PATH
# load the new configuration
source ~/.bash_profile- Tell Spark where Java is installed.
- If the file doesn't exist, copy from the template using
cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh- Find out where Java is installed. I used
update-alternatives --config javato find the installation path, and removed the/bin/javapart. - Specify where Java is installed, the master node's private IP address, and our preference for Python 3 within
/usr/local/spark/conf/spark-env.shby adding the following statements.
# add the following to conf/spark-env.sh
export SPARK_MASTER_HOST=<master-privateIP>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3- Specify the private IPs of the worker nodes, by listing them within
/usr/local/spark/conf/slaves.
# contents of conf/slaves
<worker-1-privateIP>
<worker-2-privateIP>
<worker-3-privateIP>
<worker-4-privateIP>
<worker-5-privateIP>
<worker-6-privateIP>
<worker-7-privateIP>
<worker-8-privateIP>Start the cluster by running
sh /usr/local/spark/sbin/start-all.shand monitor its status at http://<master-public-ip>:8080.
Stop the cluster by running
sh /usr/local/spark/sbin/stop-all.sh 📬 Create an issue if you have one.
(📝 instructions accurate as of Feb 2020)