Skip to content

Setting up: Spark cluster

Sarah Anoke edited this page May 5, 2020 · 4 revisions

The Spark cluster for this project consists of 1 master node and 8 worker nodes (i.e., 9 AWS EC2 instances). Before starting with this setup, it was helpful to create a personal list of cluster nodes' public* and private IPs for personal reference.

Name Instances Description Resource
1 Master m5.4xlarge Handles resource allocation 16 vCPUs, 64GB memory, 8GB storage
8 Workers m5.2xlarge Run executors that do the work 8 vCPUs, 32GB memory, 16GB storage (per node)

These notes largely follow this great tutorial, with adjustments to my particular computing architecture.

[Note] My PostgreSQL database setup occurred on this same instance, with an additional volume added.

1 - Security

Create a security group for the cluster, and add all nodes to the group. The security settings should

  • Allow nodes in the cluster to communicate with each other
  • Allow nodes in the cluster to download from the Internet (for updating packages and such)
  • Allow us to SSH into them to do installations
  • Open ports 8080 (for the web UI) and 7071 (for the master)

Inbound

Type Protocol Port Range Source Description
Custom TCP Rule TCP 8080 < your IP > access Spark admin UI
All traffic all all sg-xxx (security group ID) for node X to talk to node Y
Custom TCP Rule TCP 7077 < your IP > Spark port
Custom TCP Rule TCP 4040 < your IP > Spark context web UI
SSH TCP 22 < your IP > SSH into machine

Outbound

Type Protocol Port Range Source Description
All traffic All All sg-xxx (security group ID) for node X to receive data from node Y
All traffic All All 0.0.0.0/0 all download from external websites

2 - Set up master node

  1. Install Java 8 (the version supported by Spark 2.4.4) and Scala.
# update package list
sudo apt update

# install Java 8
sudo apt install openjdk-8-jre-headless

# confirm the version
java -version

# install scala
sudo apt install scala

# check the version 
scala -version
  1. Setup keyless SSH for easier communication with the workers, by running the following code on the master only.
sudo apt install openssh-server openssh-client

# create an RSA key pair
cd ~/.ssh

ssh-keygen -t rsa -P ""
# ...when asked for a file name, enter: id_rsa
  1. Install Spark 2.4.4.
# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz

# move the files 
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark


# edit the file `~/.bash_profile` and add the following text: 
export PATH=/usr/local/spark/bin:$PATH# load the new configuration
source ~/.bash_profile

3 - Create a worker node

This process should be repeated for each worker node.

  1. Install Java 8 (version supported by Spark 2.4.4) and Scala.
# update package list
sudo apt update

# install Java 8
sudo apt install openjdk-8-jre-headless

# check the version
java -version

# install scala
sudo apt install scala

# check the version 
scala -version
  1. Complete setup of keyless SSH.
  • Manually copy the contents of the ~/.ssh/id_rsa.pub file (located on the master node) into the ~/.ssh/authorized_keys file in each worker.
    • The entire contents should be in one line.
    • These contents start with ssh-rsa and end with ubuntu@some_ip; run cat ~/.ssh/id_rsa.pub on the master node to see them.
    • If the ~/.ssh/authorized_keys file already contains stuff, paste these new contents on top and separate with a line break.
  • Test the connection by using the master to SSH into the worker, using its private IP, e.g. for Worker 1:
ssh -i ~/.ssh/id_rsa ubuntu@<worker-privateIP>
  1. Install Spark 2.4.4.
# download Spark
wget http://apache.claz.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

# extract the files
tar xvf spark-2.4.4-bin-hadoop2.7.tgz


# move the files 
sudo mv spark-2.4.4-bin-hadoop2.7/ /usr/local/spark

# edit the file `~/.bash_profile` and add the following text: 
export PATH=/usr/local/spark/bin:$PATH# load the new configuration
source ~/.bash_profile

4 - Configure master to track workers

  1. Tell Spark where Java is installed.
  • If the file doesn't exist, copy from the template using

cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
  • Find out where Java is installed. I used update-alternatives --config java to find the installation path, and removed the /bin/java part.
  • Specify where Java is installed, the master node's private IP address, and our preference for Python 3 within /usr/local/spark/conf/spark-env.sh by adding the following statements.
# add the following to conf/spark-env.sh

export SPARK_MASTER_HOST=<master-privateIP>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
  1. Specify the private IPs of the worker nodes, by listing them within /usr/local/spark/conf/slaves.
# contents of conf/slaves

<worker-1-privateIP>
<worker-2-privateIP>
<worker-3-privateIP>
<worker-4-privateIP>
<worker-5-privateIP>
<worker-6-privateIP>
<worker-7-privateIP>
<worker-8-privateIP>

5 - Start the cluster

Start the cluster by running

sh /usr/local/spark/sbin/start-all.sh

and monitor its status at http://<master-public-ip>:8080.

Stop the cluster by running

sh /usr/local/spark/sbin/stop-all.sh 

Clone this wiki locally