Spark Wine Predictor

Project Description

The SparkWinePredictor project focuses on building a parallel machine learning application to predict wine quality using Apache Spark's MLlib on Amazon AWS. This project involves training, validating, and testing a wine quality prediction model across multiple EC2 instances and deploying the model using Docker for simplified distribution and execution.

The project objectives are as follows:

Parallel ML Model Training: Utilize Spark's distributed computing capabilities to train the model in parallel on 4 AWS EC2 instances using the provided TrainingDataset.csv.
Model Validation and Optimization: Use the ValidationDataset.csv to validate and fine-tune the model, ensuring optimal performance.
Model Testing: Evaluate the trained model's performance on unseen data using the F1 score as a key performance metric.
Dockerized Deployment: Package the Spark application and trained model into a Docker container to enable seamless deployment on a single EC2 instance for prediction.

The project will employ Spark's MLlib to implement a simple linear regression or logistic regression model for classification, starting with basic models and exploring additional ML algorithms to enhance performance. The application will classify wine quality scores (1 to 10) based on the provided datasets.

This hands-on project showcases the power of Apache Spark for distributed machine learning, the versatility of MLlib for regression and classification, and the scalability of AWS cloud infrastructure for high-performance computing tasks.

Spark Execution Model

Description: Sequence diagram illustrating the execution model of Spark applications, showing the interaction between the driver, master, executors, and storage layer.

Project Action Items

Launch 4 EC2 instances on AWS to parallelize model training.
SSH into the EC2 Instance
Transfer Files to the Instance

(Option 1) Running the Project Without Docker on the Instance

Install Java OpenJDK (required for Spark) and Apache Spark on all instances
Configure the instances to run Spark on Ubuntu Linux.
Submit the Spark Job

(Option 2) Running the Project with Docker on the Instance

Install Docker on the Instance
Build the Docker Image on the Instance
Run the Docker Container
Submit the Spark Job

Transfer Dataset to master instance

scp -i <your-key>.pem TrainingDataset.csv ubuntu@<master-public-ip>:/home/ubuntu/
scp -i <your-key>.pem ValidationDataset.csv ubuntu@<master-public-ip>:/home/ubuntu/

Set Up AWS EC2 Instances

Install required software on EC2 instances

SSH to connect to each instance, install Java and check:
ssh -i <your-key>.pem ubuntu@<instance-public-ip>
sudo apt update && sudo apt upgrade -y

Install Java
sudo apt install openjdk-11-jdk -y
java -version
Install Spark
wget https://downloads.apache.org/spark/spark-<version>/spark-<version>-bin-hadoop3.tgz
Extract and move Spark to /opt:
tar -xzf spark-<version>-bin-hadoop3.tgz
sudo mv spark-<version>-bin-hadoop3 /opt/spark
Add Spark to your PATH by editing ~/.bashrc:
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
source ~/.bashrc
Install Scala (required for Spark)
sudo apt install scala -y && scala -version

Set up passwordless SSH Access

Configure passwordless SSH access from the master node to each worker node to streamline the copying process.
1) Generate an SSH Key Pair on the Master Node:
ssh ubuntu@<MASTER_NODE_IP>
ssh-keygen -t rsa -b 2048 -f ~/.ssh/id_rsa -q -N ""
This creates a key pair:

Private key: ~/.ssh/id_rsa
Public key : ~/.ssh/id_rsa.pub

Confirm the key exists on master node: ls ~/.ssh/id_rsa.pub

2) Manually Copy the Public Key to All Worker Nodes
On the master node, output the public key:
cat ~/.ssh/id_rsa.pub
For each worker node, copy the public key from the master node to the worker's ~/.ssh/authorized_keys. Repeat this step for all worker nodes:
echo "<PASTED_PUBLIC_KEY>" >> ~/.ssh/authorized_keys
Ensure proper permissions on the worker node:
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh

From the master node, test logging into each worker node without a password:
ssh ubuntu@<WORKER_NODE_IP>

Set Up Spark Cluster

The parallel training implementation in this project leverages Apache Spark's distributed computing capabilities. This project sets up Apache Spark standalone cluster with:

1 Master Node: Coordinates and schedules the execution of tasks.
3 Worker Nodes: Execute tasks in parallel, processing data distributed across them. Each worker node has a specified number of cores and memory assigned.

Spark Cluster Resources

Visual representation of Spark cluster resource allocation, showing how CPU cores, memory, and executors are distributed across worker nodes in the cluster.

Commands

Start Spark master and worker instances
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slave.sh spark://172.31.25.1:7077

Stop existing worker process: $SPARK_HOME/sbin/stop-worker.sh
Check for Spark worker processes: ps -ef | grep Worker

Spark Web UI:

http://<master-public-ip>:8080

Configure Spark Cluster

1) Verify Spark Installation: /opt/spark/bin/spark-shell --version

2) Set environment variables (add to ~/.bashrc):
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Apply changes: source ~/.bashrc

3) Configure spark-env.sh
SPARK_MASTER_HOST=<master_node_private_ip>
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_CONF_DIR=$SPARK_HOME/conf
SPARK_WORKER_CORES=6 # Adjust based on your instance resources
SPARK_WORKER_MEMORY=18G # Adjust based on your instance resources
SPARK_WORKER_INSTANCES=1
SPARK_EXECUTOR_INSTANCES=1

4) Distribute the spark-env.sh file to all worker nodes:
scp $SPARK_HOME/conf/spark-env.sh ubuntu@<worker_node_ip>:$SPARK_HOME/conf/spark-env.sh

5) Configure worker files
sudo nano $SPARK_HOME/conf/slaves
Add the private IPs or hostnames of all worker nodes, one per line:
<worker_node_1_private_ip>
<worker_node_2_private_ip>
<worker_node_3_private_ip>
<worker_node_4_private_ip>

Distribute the worker file to all worker nodes:
scp $SPARK_HOME/conf/slaves ubuntu@<worker_node_ip>:$SPARK_HOME/conf/slaves

6) Start Spark Cluster
On master node: $SPARK_HOME/sbin/start-master.sh
On worker node: $SPARK_HOME/sbin/start-slave.sh spark://<master_node_private_ip>:7077

Use Docker in a Spark Cluster

Set up Docker

0. Transfer Files to the Instance: Use scp to transfer all necessary files (e.g., Dockerfile, app.py, requirements.txt, TrainingDataset.csv) to the EC2 instance.

1. Build and Tag Your Docker Image Locally
docker build -t wine-quality-app .
verify using: docker images

2. Push the Docker Image to a Registry - Docker Hub
Login to Docker Hub: docker login
Tag and push images: docker tag wine-quality-app DOCKERHUB-USERNAME/wine-quality-app:latest docker push DOCKERHUB-USERNAME/wine-quality-app:latest

3. Install Docker on All Nodes
ssh -i "your_key.pem" ubuntu@node-ip
sudo apt update
sudo apt install -y docker.io
sudo systemctl start docker
sudo systemctl enable docker

4. Configure Docker on EC2 Instance

Pull the Docker Image on All Nodes
sudo docker pull DOCKERHUB-USERNAME/wine-quality-app:latest
Add user to the Docker group to avoid needing sudo for Docker commands:
Check if your user (ubuntu) is part of the docker group:
groups
If not, add by using:
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world

Running the Project with Docker

5. Build the Docker Image on the Instance
cd /home/ubuntu/code
docker build -t wine-quality-app .

6. Run the Docker Container
docker run -it --rm \ --network="host" \ wine-quality-app

7. Configure Spark to Use Docker
Edit the spark-env.sh file:
sudo nano $SPARK_HOME/conf/spark-env.sh
Add the following line:
SPARK_EXECUTOR_OPTS="--conf spark.executor.docker.image=your-image-name" SPARK_DRIVER_OPTS="--conf spark.driver.docker.image=your-image-name"

8. Submit the Spark Application

Docker Hub:

https://hub.docker.com/repository/docker/chloecodes/wine-quality-app/general

Training and Prediction Application

Spark will use the Docker container to execute the tasks in parallel across all nodes.

Submit the Spark Application on master node:
$SPARK_HOME/bin/spark-submit \
--master spark://172.31.25.1:7077 \
--deploy-mode client \
--executor-memory 18G \
--total-executor-cores 6 \
/home/ubuntu/code/app.py

Key Notes

With Docker: Simplifies setup and ensures consistency across all environments. Requires a properly built Docker image.
Without Docker: Requires manual setup on each node, including installing dependencies and configuring Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
diagram		diagram
docker		docker
python		python
scala		scala
.dockerignore		.dockerignore
DIAGRAMS.md		DIAGRAMS.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Wine Predictor

Project Description

Spark Execution Model

Project Action Items

(Option 1) Running the Project Without Docker on the Instance

(Option 2) Running the Project with Docker on the Instance

Transfer Dataset to master instance

Set Up AWS EC2 Instances

Install required software on EC2 instances

Set up passwordless SSH Access

Set Up Spark Cluster

Spark Cluster Resources

Commands

Spark Web UI:

Configure Spark Cluster

Use Docker in a Spark Cluster

Set up Docker

Running the Project with Docker

Docker Hub:

Training and Prediction Application

Key Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

pengwingokla/Spark-Wine-Predictor

Folders and files

Latest commit

History

Repository files navigation

Spark Wine Predictor

Project Description

Spark Execution Model

Project Action Items

(Option 1) Running the Project Without Docker on the Instance

(Option 2) Running the Project with Docker on the Instance

Transfer Dataset to master instance

Set Up AWS EC2 Instances

Install required software on EC2 instances

Set up passwordless SSH Access

Set Up Spark Cluster

Spark Cluster Resources

Commands

Spark Web UI:

Configure Spark Cluster

Use Docker in a Spark Cluster

Set up Docker

Running the Project with Docker

Docker Hub:

Training and Prediction Application

Key Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages