Setting up: Spark cluster

The Spark cluster for this project consists of 1 master node and 8 worker nodes (i.e., 9 AWS EC2 instances). Before starting with this setup, it was helpful to create a personal list of cluster nodes' public* and private IPs for personal reference.

Name	Instances	Description	Resource
1 Master	m5.4xlarge	Handles resource allocation	16 vCPUs, 64GB memory, 8GB storage
8 Workers	m5.2xlarge	Run executors that do the work	8 vCPUs, 32GB memory, 16GB storage (per node)

These notes largely follow this great tutorial, with adjustments to my particular computing architecture.

[Note] My PostgreSQL database setup occurred on this same instance, with an additional volume added.

1 - Security

Create a security group for the cluster, and add all nodes to the group. The security settings should

Allow nodes in the cluster to communicate with each other
Allow nodes in the cluster to download from the Internet (for updating packages and such)
Allow us to SSH into them to do installations
Open ports 8080 (for the web UI) and 7071 (for the master)

Inbound

Type	Protocol	Port Range	Source	Description
Custom TCP Rule	TCP	8080	< your IP >	access Spark admin UI
All traffic	all	all	sg-xxx (security group ID)	for node X to talk to node Y
Custom TCP Rule	TCP	7077	< your IP >	Spark port
Custom TCP Rule	TCP	4040	< your IP >	Spark context web UI
SSH	TCP	22	< your IP >	SSH into machine

Outbound

Type	Protocol	Port Range	Source	Description
All traffic	All	All	sg-xxx (security group ID)	for node X to receive data from node Y
All traffic	All	All	0.0.0.0/0	all download from external websites

2 - Set up master node

Install Java 8 (the version supported by Spark 2.4.4) and Scala.

# update package list
sudo apt update

# install Java 8
sudo apt install openjdk-8-jre-headless

# confirm the version
java -version

# install scala
sudo apt install scala

# check the version 
scala -version

Setup keyless SSH for easier communication with the workers, by running the following code on the master only.

sudo apt install openssh-server openssh-client

# create an RSA key pair
cd ~/.ssh 
ssh-keygen -t rsa -P ""
# ...when asked for a file name, enter: id_rsa

Install Spark 2.4.4.

# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz

# move the files 
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark 

# edit the file `~/.bash_profile` and add the following text: 
export PATH=/usr/local/spark/bin:$PATH 

# load the new configuration
source ~/.bash_profile

3 - Create a worker node

This process should be repeated for each worker node.

Install Java 8 (version supported by Spark 2.4.4) and Scala.

# update package list
sudo apt update

# install Java 8
sudo apt install openjdk-8-jre-headless

# check the version
java -version

# install scala
sudo apt install scala

# check the version 
scala -version

Complete setup of keyless SSH.

Manually copy the contents of the ~/.ssh/id_rsa.pub file (located on the master node) into the ~/.ssh/authorized_keys file in each worker.
- The entire contents should be in one line.
- These contents start with ssh-rsa and end with ubuntu@some_ip; run cat ~/.ssh/id_rsa.pub on the master node to see them.
- If the ~/.ssh/authorized_keys file already contains stuff, paste these new contents on top and separate with a line break.
Test the connection by using the master to SSH into the worker, using its private IP, e.g. for Worker 1:

ssh -i ~/.ssh/id_rsa ubuntu@<worker-privateIP>

Install Spark 2.4.4.

# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz 

# move the files 
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark

# edit the file `~/.bash_profile` and add the following text: 
export PATH=/usr/local/spark/bin:$PATH 

# load the new configuration
source ~/.bash_profile

4 - Configure master to track workers

Tell Spark where Java is installed.

If the file doesn't exist, copy from the template using

cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh

Find out where Java is installed. I used update-alternatives --config java to find the installation path, and removed the /bin/java part.
Specify where Java is installed, the master node's private IP address, and our preference for Python 3 within /usr/local/spark/conf/spark-env.sh by adding the following statements.

# add the following to conf/spark-env.sh 
export SPARK_MASTER_HOST=<master-privateIP>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

Specify the private IPs of the worker nodes, by listing them within /usr/local/spark/conf/slaves.

# contents of conf/slaves 
<worker-1-privateIP>
<worker-2-privateIP>
<worker-3-privateIP>
<worker-4-privateIP>
<worker-5-privateIP>
<worker-6-privateIP>
<worker-7-privateIP>
<worker-8-privateIP>

5 - Start the cluster

Start the cluster by running

sh /usr/local/spark/sbin/start-all.sh

and monitor its status at http://<master-public-ip>:8080.

Stop the cluster by running

sh /usr/local/spark/sbin/stop-all.sh

📬 Create an issue if you have one.

(📝 instructions accurate as of Feb 2020)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up: Spark cluster

1 - Security

Inbound

Outbound

2 - Set up master node

3 - Create a worker node

4 - Configure master to track workers

5 - Start the cluster

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Setting up

Data pipeline

Clone this wiki locally