Step-by-step guide for setting up a 3-node Apache Spark cluster on AWS EC2.
This lab provides hands-on experience with setting up and managing a distributed Apache Spark cluster on Amazon EC2. You will learn how to create, configure, and use a Spark cluster through three progressive problems:
- Manual Setup - Build cluster step-by-step to understand all components
- Automated Setup - Use automation scripts for rapid deployment
- Data Analysis - Run real-world data processing jobs on your cluster
By completing this lab, you will gain practical experience with:
- Creating and configuring EC2 instances using AWS CLI
- Setting up network security groups for cluster communication
- Configuring SSH keys and passwordless access between nodes
- Installing and configuring Apache Spark in distributed mode
- Monitoring Spark jobs via Web UIs
- Running PySpark jobs on a distributed cluster
- Processing large-scale data stored in Amazon S3
- Proper cluster resource cleanup and cost management
- AWS Account with appropriate IAM permissions
- Access to an EC2 instance (Linux) to run the setup commands
- Basic familiarity with terminal/command line
- AWS CLI configured with credentials
- Your laptop's public IP address (get from https://ipchicken.com/)
This lab is probably the most important one for your project. The skills you learn here - setting up a cluster, running code both locally and on a distributed cluster, cleaning up resources, viewing metrics in the Spark UI - are critical for using Spark in your final project.
Make sure you are very comfortable with all these operations and can easily do them repeatedly. In fact, it might be a good idea to repeat this lab once you have completed and submitted it so that you become super conversant with:
- Setting up a cluster from scratch
- Running code in distributed mode
- Monitoring jobs through the Spark Web UIs
- Troubleshooting issues
- Cleaning up resources properly
The step-by-step guide in SPARK_CLUSTER_SETUP.md gives you insight into what actually goes on when setting up a distributed cluster. All of these are important foundational concepts that apply to having multiple machines communicate and work together. While the implementation is AWS-specific, these concepts are cloud-agnostic and apply to all cloud providers (AWS, Azure, GCP, etc.).
Understanding these fundamentals will make you proficient in distributed computing, regardless of which cloud platform you use in your career.
Before diving into the cluster setup, let's review last week's NYC TLC problem - but this time in a Jupyter notebook. Sometimes it's easier to work with and experiment with code in a notebook environment rather than running standalone Python scripts.
Task: Run the NYC TLC Problem 1 analysis in a Jupyter notebook with Spark running locally.
All code has been provided for you in cluster-files/nyc_tlc_problem1.ipynb. Just open the notebook and run all the cells to confirm that Spark works correctly on your EC2 instance.
Objective: Gain hands-on understanding of every component involved in creating a Spark cluster.
Follow the detailed instructions in SPARK_CLUSTER_SETUP.md to:
- Manually create EC2 instances
- Configure security groups and network rules
- Set up SSH keys and passwordless authentication
- Install Java, Python, and Spark on each node
- Configure master and worker nodes
- Start the Spark cluster
- Verify cluster operation through Web UIs
Deliverables:
- Working 3-node Spark cluster (1 master + 2 workers)
- Screenshots of:
- Spark Master Web UI showing connected workers (placeholder - to be provided)
- Spark Application UI showing a running job (placeholder - to be provided)
- Download result files from master node to your EC2 instance:
scp -i spark-cluster-key.pem ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem1_cluster.txt .
- Commit and push the following files:
cluster-config.txtcluster-ips.txtproblem1_cluster.txt(cluster run output)
Cleanup: After completing Step 1, manually delete resources in this order:
- Terminate EC2 instances
- Delete key pair
- Delete security group
Objective: Learn how to automate cluster deployment for rapid provisioning.
Follow the instructions in AUTOMATION_README.md to:
- Run the automated setup script
setup-spark-cluster.sh - Create a 4-node cluster (1 master + 3 workers) automatically
- Understand the automation workflow
- Submit a Spark job to the cluster
- Use the automated cleanup script
Key Script:
./setup-spark-cluster.sh <YOUR_LAPTOP_IP>This script automates everything from Step 1, including:
- Security group creation
- Key pair generation
- EC2 instance provisioning
- Software installation on all nodes
- Spark configuration
- Cluster startup
Deliverables:
- Download result files from master node to your EC2 instance:
# Use the SSH key from the setup (check cluster-config.txt for KEY_FILE name) source cluster-config.txt scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/*.txt .
- Commit and push the following files:
cluster-config.txtcluster-ips.txtssh_to_master_node.sh- Output files from running the NYC TLC job (downloaded from master node)
Cleanup: Use the automated cleanup script:
./cleanup-spark-cluster.shOr manually delete resources as in Problem 1.
Objective: Apply your Spark cluster skills to analyze real-world social media data.
- Install Java (required for local PySpark development):
# Install Java 17 (required for PySpark 4.x)
sudo apt update
sudo apt install -y openjdk-17-jdk-headless
# Verify installation
java -version
# Set JAVA_HOME (add to ~/.bashrc for persistence)
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin- Download a sample file for local development:
# Download one file to test locally first (saves to current directory)
aws s3 cp s3://dsan6000-datasets/reddit/parquet/comments/yyyy=2024/mm=01/comments_RC_2024-01.zst_97.parquet \
reddit_sample.parquet \
--request-payer requester- Create S3 bucket for your data (after local testing works):
aws s3 mb s3://your-netid-spark-reddit- Copy full Reddit data from shared dataset (after local testing works):
aws s3 cp s3://dsan6000-datasets/reddit/parquet/comments/yyyy=2024/mm=01/ \
s3://your-netid-spark-reddit/reddit/comments/yyyy=2024/mm=01/ \
--recursive \
--request-payer requester \
--metadata-directive REPLACEDataset: Reddit comments from January 2024 (Parquet format)
- Source:
s3://dsan6000-datasets/reddit/parquet/comments/yyyy=2024/mm=01/(use--request-payer requesterto access) - Format: Parquet files with columns including
subreddit,body,author,created_utc, etc.
Task: Analyze the Reddit comments data to answer the following questions:
- Dataset Statistics: Count the total number of unique comments and unique users (authors) in the dataset
- Most Popular Subreddits: Find the top 10 most active subreddits by comment count
- Temporal Analysis: Identify peak commenting hours (UTC) across all subreddits
Development Workflow:
-
Local Development (Recommended Approach):
- Download one sample file:
reddit_sample.parquet(in current directory) - Create
reddit_analysis_local.pythat works with local file (save in repo root) - Use local Spark session (similar to the lab exercises):
spark = SparkSession.builder \ .appName("Reddit_Local") \ .getOrCreate() df = spark.read.parquet("reddit_sample.parquet")
- Test your analysis logic on the sample file
- Verify outputs are correct (see Deliverables section below for exact file names)
- Download one sample file:
-
Cluster Deployment:
- Once local version works, create
reddit_analysis_cluster.py - Modify the code to use cluster master URL (see nyc_tlc_problem1_cluster.py for reference):
- Change
.getOrCreate()to.master(master_url).getOrCreate() - Change file path from local to S3A:
s3a://your-netid-spark-reddit/... - Add S3A configuration (copy from NYC TLC example)
- Accept master URL from command line argument
- Change
- Copy full dataset to your S3 bucket
- Copy your script to the cluster master node:
# Load cluster configuration source cluster-config.txt # Copy your cluster script to master node scp -i $KEY_FILE reddit_analysis_cluster.py ubuntu@$MASTER_PUBLIC_IP:~/
- SSH to master and run on your Spark cluster:
ssh -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP source cluster-ips.txt uv run python reddit_analysis_cluster.py spark://$MASTER_PRIVATE_IP:7077
β±οΈ IMPORTANT - Expected Runtime: This job processes ~280 million Reddit comments and is expected to take 10-15 minutes to complete when running correctly on the cluster. This is normal! While the job is running:
- Monitor progress in the Spark Master UI at
http://$MASTER_PUBLIC_IP:8080to see worker activity - Watch the Spark Application UI at
http://$MASTER_PUBLIC_IP:4040to see the job DAG, stages, and tasks - You'll see three separate analyses run sequentially (dataset stats, top subreddits, peak hours)
- Each analysis will read the data from S3 and perform distributed computations across your worker nodes
Don't worry if it seems slow - you're processing hundreds of millions of records! Use the Web UIs to understand what Spark is doing behind the scenes.
- Download results from cluster:
# From your EC2 instance (not on the cluster) source cluster-config.txt # Download the cluster output CSV files (already have _cluster suffix) scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/dataset_stats_cluster.csv . scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/top_subreddits_cluster.csv . scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/peak_hours_cluster.csv .
- Once local version works, create
Deliverables:
reddit_analysis_local.py(local development version)reddit_analysis_cluster.py(cluster version)- Local output CSV files:
dataset_stats_local.csv(Unique comments and users)top_subreddits_local.csv(Top 10 most popular subreddits)peak_hours_local.csv(Hourly comment distribution)
- Cluster output CSV files:
dataset_stats_cluster.csv(Unique comments and users)top_subreddits_cluster.csv(Top 10 most popular subreddits)peak_hours_cluster.csv(Hourly comment distribution)
- Screenshots of:
- Spark Web UI showing the job execution
- Application UI showing the DAG (Directed Acyclic Graph)
Hints:
- Use
spark.read.parquet()to load data - Count unique values using
df.select("column").distinct().count()orcountDistinct("column") - Extract hour from
created_utcusinghour(from_unixtime(col("created_utc"))) - Use
groupBy()andagg()for aggregations - Use
orderBy(desc("count"))for sorting - Save results using
.toPandas().to_csv()ordf.write.csv()
Expected Warning (Normal Behavior):
When reading parquet files from S3, you may see a FileNotFoundException warning about streaming metadata:
WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://...
java.io.FileNotFoundException: No such file or directory: s3a://your-netid-spark-reddit/...
This is normal and harmless. Spark checks if the path contains streaming metadata (used for Spark Structured Streaming), and when it doesn't find any, it proceeds to read the files as regular batch data. Your job will continue successfully despite this warning.
Important: Running this cluster will incur AWS costs.
- Instance type: t3.large
- Number of instances:
- Problem 1: 3 instances (1 master + 2 workers) β $0.20/hour
- Problem 2: 4 instances (1 master + 3 workers) β $0.25/hour
- Problem 3: 4 instances + S3 storage β $0.25/hour + minimal storage costs
- Estimated total cost: $6-8 for completing all three problems
Always remember to terminate your instances and delete S3 data when done to avoid unnecessary charges.
lab-spark-cluster/
βββ README.md # This file - overview and 3-problem guide
βββ SPARK_CLUSTER_SETUP.md # Problem 1: Manual setup instructions
βββ AUTOMATION_README.md # Problem 2: Automated setup guide
βββ setup-spark-cluster.sh # Automated cluster creation script
βββ cleanup-spark-cluster.sh # Automated cluster cleanup script
βββ reddit_analysis_local.py # Problem 3: Local analysis script
βββ reddit_analysis_cluster.py # Problem 3: Cluster analysis script
βββ cluster-files/ # Configuration files for cluster nodes
β βββ nyc_tlc_problem1_cluster.py
βββ pyproject.toml # Python dependencies
βββ .gitignore # Git ignore file (includes *.pem, *.parquet)
- Clone this repository (if in a git environment) or navigate to the project directory:
cd /home/ubuntu/lab-spark-cluster-
Start with Step 1 - Follow SPARK_CLUSTER_SETUP.md for manual setup
-
Progress to Step 2 - Follow AUTOMATION_README.md for automated setup
-
Complete Step 3 - Work on the Reddit data analysis problem
If you encounter issues:
- Check the Troubleshooting sections in the respective guides
- Verify all environment variables are set correctly
- Check the Spark logs on master and worker nodes:
$SPARK_HOME/logs/ - Ensure security group rules are properly configured
- Verify AWS credentials and S3 bucket permissions
- Never commit
.pemfiles - They are in.gitignore - Limit security group access - Only allow your IP addresses
- Use IAM roles - Instances use
LabInstanceProfilefor S3 access - Delete resources - Always clean up when done
MIT License