The SparkWinePredictor project focuses on building a parallel machine learning application to predict wine quality using Apache Spark's MLlib on Amazon AWS. This project involves training, validating, and testing a wine quality prediction model across multiple EC2 instances and deploying the model using Docker for simplified distribution and execution.
The project objectives are as follows:
- Parallel ML Model Training: Utilize Spark's distributed computing capabilities to train the model in parallel on 4 AWS EC2 instances using the provided TrainingDataset.csv.
- Model Validation and Optimization: Use the ValidationDataset.csv to validate and fine-tune the model, ensuring optimal performance.
- Model Testing: Evaluate the trained model's performance on unseen data using the F1 score as a key performance metric.
- Dockerized Deployment: Package the Spark application and trained model into a Docker container to enable seamless deployment on a single EC2 instance for prediction.
The project will employ Spark's MLlib to implement a simple linear regression or logistic regression model for classification, starting with basic models and exploring additional ML algorithms to enhance performance. The application will classify wine quality scores (1 to 10) based on the provided datasets.
This hands-on project showcases the power of Apache Spark for distributed machine learning, the versatility of MLlib for regression and classification, and the scalability of AWS cloud infrastructure for high-performance computing tasks.
Description: Sequence diagram illustrating the execution model of Spark applications, showing the interaction between the driver, master, executors, and storage layer.
- Launch 4 EC2 instances on AWS to parallelize model training.
- SSH into the EC2 Instance
- Transfer Files to the Instance
- Install Java OpenJDK (required for Spark) and Apache Spark on all instances
- Configure the instances to run Spark on Ubuntu Linux.
- Submit the Spark Job
- Install Docker on the Instance
- Build the Docker Image on the Instance
- Run the Docker Container
- Submit the Spark Job
scp -i <your-key>.pem TrainingDataset.csv ubuntu@<master-public-ip>:/home/ubuntu/
scp -i <your-key>.pem ValidationDataset.csv ubuntu@<master-public-ip>:/home/ubuntu/
SSH to connect to each instance, install Java and check:
ssh -i <your-key>.pem ubuntu@<instance-public-ip>
sudo apt update && sudo apt upgrade -y
- Install Java
sudo apt install openjdk-11-jdk -y
java -version - Install Spark
wget https://downloads.apache.org/spark/spark-<version>/spark-<version>-bin-hadoop3.tgz - Extract and move Spark to /opt:
tar -xzf spark-<version>-bin-hadoop3.tgz
sudo mv spark-<version>-bin-hadoop3 /opt/spark - Add Spark to your PATH by editing
~/.bashrc:
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
source ~/.bashrc - Install Scala (required for Spark)
sudo apt install scala -y&&scala -version
Configure passwordless SSH access from the master node to each worker node to streamline the copying process.
1) Generate an SSH Key Pair on the Master Node:
ssh ubuntu@<MASTER_NODE_IP>
ssh-keygen -t rsa -b 2048 -f ~/.ssh/id_rsa -q -N ""
This creates a key pair:
- Private key:
~/.ssh/id_rsa - Public key :
~/.ssh/id_rsa.pub
Confirm the key exists on master node: ls ~/.ssh/id_rsa.pub
2) Manually Copy the Public Key to All Worker Nodes
On the master node, output the public key:
cat ~/.ssh/id_rsa.pub
For each worker node, copy the public key from the master node to the worker's ~/.ssh/authorized_keys. Repeat this step for all worker nodes:
echo "<PASTED_PUBLIC_KEY>" >> ~/.ssh/authorized_keys
Ensure proper permissions on the worker node:
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh
From the master node, test logging into each worker node without a password:
ssh ubuntu@<WORKER_NODE_IP>
The parallel training implementation in this project leverages Apache Spark's distributed computing capabilities. This project sets up Apache Spark standalone cluster with:
- 1 Master Node: Coordinates and schedules the execution of tasks.
- 3 Worker Nodes: Execute tasks in parallel, processing data distributed across them. Each worker node has a specified number of cores and memory assigned.
Visual representation of Spark cluster resource allocation, showing how CPU cores, memory, and executors are distributed across worker nodes in the cluster.
Start Spark master and worker instances
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slave.sh spark://172.31.25.1:7077
Stop existing worker process: $SPARK_HOME/sbin/stop-worker.sh
Check for Spark worker processes:
ps -ef | grep Worker
http://<master-public-ip>:8080
1) Verify Spark Installation: /opt/spark/bin/spark-shell --version
2) Set environment variables (add to ~/.bashrc):
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Apply changes: source ~/.bashrc
3) Configure spark-env.sh
SPARK_MASTER_HOST=<master_node_private_ip>
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_CONF_DIR=$SPARK_HOME/conf
SPARK_WORKER_CORES=6 # Adjust based on your instance resources
SPARK_WORKER_MEMORY=18G # Adjust based on your instance resources
SPARK_WORKER_INSTANCES=1
SPARK_EXECUTOR_INSTANCES=1
4) Distribute the spark-env.sh file to all worker nodes:
scp $SPARK_HOME/conf/spark-env.sh ubuntu@<worker_node_ip>:$SPARK_HOME/conf/spark-env.sh
5) Configure worker files
sudo nano $SPARK_HOME/conf/slaves
Add the private IPs or hostnames of all worker nodes, one per line:
<worker_node_1_private_ip>
<worker_node_2_private_ip>
<worker_node_3_private_ip>
<worker_node_4_private_ip>
Distribute the worker file to all worker nodes:
scp $SPARK_HOME/conf/slaves ubuntu@<worker_node_ip>:$SPARK_HOME/conf/slaves
6) Start Spark Cluster
On master node: $SPARK_HOME/sbin/start-master.sh
On worker node: $SPARK_HOME/sbin/start-slave.sh spark://<master_node_private_ip>:7077
0. Transfer Files to the Instance: Use scp to transfer all necessary files (e.g., Dockerfile, app.py, requirements.txt, TrainingDataset.csv) to the EC2 instance.
1. Build and Tag Your Docker Image Locally
docker build -t wine-quality-app .
verify using: docker images
2. Push the Docker Image to a Registry - Docker Hub
Login to Docker Hub: docker login
Tag and push images: docker tag wine-quality-app DOCKERHUB-USERNAME/wine-quality-app:latest docker push DOCKERHUB-USERNAME/wine-quality-app:latest
3. Install Docker on All Nodes
ssh -i "your_key.pem" ubuntu@node-ip
sudo apt update
sudo apt install -y docker.io
sudo systemctl start docker
sudo systemctl enable docker
4. Configure Docker on EC2 Instance
- Pull the Docker Image on All Nodes
sudo docker pull DOCKERHUB-USERNAME/wine-quality-app:latest - Add user to the Docker group to avoid needing sudo for Docker commands:
Check if your user (ubuntu) is part of the docker group:
groups
If not, add by using:
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
5. Build the Docker Image on the Instance
cd /home/ubuntu/code
docker build -t wine-quality-app .
6. Run the Docker Container
docker run -it --rm \ --network="host" \ wine-quality-app
7. Configure Spark to Use Docker
Edit the spark-env.sh file:
sudo nano $SPARK_HOME/conf/spark-env.sh
Add the following line:
SPARK_EXECUTOR_OPTS="--conf spark.executor.docker.image=your-image-name"
SPARK_DRIVER_OPTS="--conf spark.driver.docker.image=your-image-name"
8. Submit the Spark Application
https://hub.docker.com/repository/docker/chloecodes/wine-quality-app/general
Spark will use the Docker container to execute the tasks in parallel across all nodes.
Submit the Spark Application on master node:
$SPARK_HOME/bin/spark-submit \
--master spark://172.31.25.1:7077 \
--deploy-mode client \
--executor-memory 18G \
--total-executor-cores 6 \
/home/ubuntu/code/app.py
- With Docker: Simplifies setup and ensures consistency across all environments. Requires a properly built Docker image.
- Without Docker: Requires manual setup on each node, including installing dependencies and configuring Spark.

