End-To-End Streaming Big Data Project makes big data processing easy with Airflow, Kafka, Spark, MinIO and much more!!
- Streaming Big Amount of Data using Kafka and SparkStreaming.
- Managing Apache Kafka with Confluent Control Center, Apache Zookeeper and Schema Registry.
- Processing Data Lake using DeltaLake, Object Storage with MinIO.
- ELT Pipeline:
- Automated Medallion Architecture Implementation on the dataset Airflow.
- Data Modeling and Data Warehousing with PostgreSQL and Dbt
- Distributed query engine Trino with DBeaver for high query performance.
- Data Visualization Tools with Superset.
- Project Report.
This project uses Amazon Sales Report data, you can find the data here: https://github.com/AshaoluV/Amazon-Sales-Project/blob/main/Amazon%20Sales.csv
- Streaming, Batching Data Process: Apache Kafka, Apache Spark.
- IDE: Pycharm
- Programming Languages: Python.
- Data Orchestration Tool: Apache Airflow.
- Data Lake/ Data Lakehouse: DeltaLake, MinIO.
- Data Visualization Tool: Superset.
- Containerization: Docker, Docker Compose.
- Query Engine: DBeaver, Trino.
- Data Transformation, Data Modeling and Data Warehousing: dbt, PostgreSQL
- First, you'll have your Pycharm IDE, Docker, Apache Kafka, Apache Spark and Apache Airflow setup in your project.
- In your terminal, create a python virtual environment to work with, run (if you are using Windows):
python -m venv venvvenv\Scripts\activatepython -m pip install -r requirements.txt(download all required libraries for the project)
- Launch Docker:
docker compose up -d - Run event_streaming python file in Kafka events.
- Run the command: python spark_streaming/sales_delta_spark_to_minio.py (submiting spark job and stream the data to MinIO)
- Access the service:
- Confluent Control Center for Kafka is accessible at
http://localhost:9021.
- MinIO is accessible at
http://localhost:9001.
- Trino is accessible at
http://localhost:8084.
A lot can still be done :)
- Choose managed Infra
- Cloud Composer for Airflow, Kafka and Spark using AWS.
- Kafka Streaming process monitering with Prometheus and Grafana.
- Include CI/CD Operations.
- Write data quality tests.
- Storage Layer Deployment with AWS S3 and Terraform.
© 2025 Nguyen Dai



