spark-minio-project

Builds a Spark Standalone Cluster on Docker in local with MinIO integration.

This can be a Base Project for building and playing around different integrations with Spark as compute engine and MinIO as Object Storage.

Few of the integration examples include:

Open Format Tables (Apache Iceberg, Apache Hudi and Delta Lake)
OLAP engines like StarRocks for analytics for tables populated from Spark and data present in MinIO
Airflow integration to run Spark Jobs from Airflow

This project was built by referring the structure from this Repo for Spark Standalone cluster, and modifications are made on top of it.

Getting Started

To make getting started with project easy, it includes a Makefile

In case, you are on Windows system, make commands might not work, in that case you can just use the respective docker commands present in the Makefile.

Makefile Commands

Downloading Jars that needs to be used by Spark Standalone Cluster

To download the required jars from MVNRepositories to be used in Spark Standalone Cluster.

make jars-download

Jars that needs to be downloaded are included in Jar Downloader Bash Script

This creates a jars folder in the repo and downloads all the mentioned jars there.

This currently includes jars to support MinIO integration and Read/Write into Iceberg tables.

Note: To add more jars to be downloaded via jars-downloader, add those jars in JAR_MAPPING in format jar_name|maven_path present in the jars-downloader.sh

Building the Docker Images

make build-nc

Builds images present in the Dockerfile, that includes:

Spark 3.5 (for master, worker and SparkHistory Server)
MinIO
JupyterLab

Running the containers

make run

Starts all the services, once all the images are built.

This creates a Spark Standalone Cluster with 1 worker node.

To create a multi worker cluster, run

make run-scaled

This creates a Spark Standalone Cluster with 3 worker nodes. To modify the number of worked nodes, update the number directly in Makefile command.

Once all the services are up and running, these services can be accesed in local on below URL:

Spark Master: localhost:9090
Spark History Server: localhost:18080
MinIO UI: localhost:9001
Jupyter Notebook Server: localhost:8888

Spark Master starts at spark://spark-master:7077

Integration Testing and Submitting PySpark Scripts

There are scripts present in spark_apps to test if the integration is working fine after everything is up and running.

To test the integration, you can run

make submit app=spark_minio_test.py

Once this runs successfully you can go to MinIO UI, login via ROOT_USER and ROOT_PASSWORD present in .env file, and you will be able to see the data within warehouse bucket.

To submit a spark job

make submit app=pyfilename.py # file present in spark_apps folder

To shut down everything

make down

Data Mapping

To keep the data persistent that is being written via Spark:

MinIO Buckets data, minio_data folder in repo is mapped with MinIO container volume.
To read any data files while running the jobs in Spark Cluster, data can either be uploaded directly to MinIO buckets from MinIO UI running at localhost:9001 or can be kept at data folder in repo.
- data folder in Spark Cluster is mapped to /opt/spark/data
- To read data from here in spark jobs: spark.read.parquet('/opt/spark/data/file.parquet')

It's recommended to upload files in MinIO directly, to keep it simple to read data from Spark Jobs as well as the Jupyter Notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
jars		jars
notebooks		notebooks
requirements		requirements
spark-conf		spark-conf
spark_apps		spark_apps
templates		templates
.env		.env
.env.spark		.env.spark
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose-generator.sh		docker-compose-generator.sh
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
jar-downloader.sh		jar-downloader.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spark-minio-project

Getting Started

Makefile Commands

Downloading Jars that needs to be used by Spark Standalone Cluster

Building the Docker Images

Running the containers

Integration Testing and Submitting PySpark Scripts

To shut down everything

Data Mapping

About

Uh oh!

Releases

Packages

Languages

License

guptaakashdeep/spark-minio-project

Folders and files

Latest commit

History

Repository files navigation

spark-minio-project

Getting Started

Makefile Commands

Downloading Jars that needs to be used by Spark Standalone Cluster

Building the Docker Images

Running the containers

Integration Testing and Submitting PySpark Scripts

To shut down everything

Data Mapping

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages