Real-Time Data Lake Pipeline with Iceberg, Spark, and Kafka

A scalable, near-real-time data pipeline for ingesting, processing, and querying JSON data using Apache Iceberg, Spark Structured Streaming, and Kafka. Designed for ACID-compliant storage, efficient upserts, and seamless cloud deployment.

📌 Features

Near-Real-Time Ingestion: API layer with FastAPI for JSON file uploads and Kafka for event streaming.
ACID-Compliant Storage: Apache Iceberg tables managed by Nessie Catalog for versioning and schema enforcement.
Distributed Processing: Spark Structured Streaming with micro-batches (0.1s intervals) for validation, deduplication, and merging.
Optimized Querying: Trino SQL engine for low-latency analytics and time-travel queries.
Cloud-Ready: Dockerized components (MinIO, Kafka, Spark, Nessie, Trino) with AWS deployment guidelines.

🏗 Architecture

API Layer: FastAPI endpoints ingest JSON files into MinIO and publish metadata to Kafka.
Event Streaming: Kafka decouples ingestion from processing, ensuring fault tolerance.
Spark Processing: Micro-batch jobs validate, clean, and merge data into Iceberg tables.
Iceberg Storage: Partitioned, compressed tables with ZSTD and automated compaction.
Trino Analytics: SQL queries on Iceberg tables with Nessie versioning.

🛠 Prerequisites

Docker & Docker Compose

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/Elkoumy/real_time_data_lake.git

2. Start the Docker Containers

cd real_time_data_lake

docker-compose up -d

Services

Services Included in the docker-compose file:

MinIO: S3-compatible object storage for JSON files.
Kafka: Distributed event streaming platform.
Spark: Unified analytics engine for big data processing.
Nessie: Git-like versioning for Iceberg tables.
Trino: Distributed SQL query engine for Iceberg tables.
FastAPI: Web API framework for JSON file uploads.
Data Simulator: Python script for generating sample JSON data upload requests.

📂 Directory Structure

├── webservice/            # FastAPI upload service
├── simulator/             # Upload JSON data simulator
├── spark-jobs/            # Spark Structured Streaming jobs
├── trino/                 # Trino configuration and queries
├── docker-compose.yml     # Orchestration
├── docs/                  # Architecture diagrams and notes
└── data                   # Sample JSON data and schema

🖥 Usage

1. Upload JSON Files via API

curl -X POST -F "file=@data/employees/employees_4.json" http://localhost:8000/upload/employees_4

2. Query Data with Trino

docker exec -it trino trino

Run queries:

SELECT * FROM iceberg_datalake.default.sessions;

☁️ AWS Deployment

To deploy on AWS:

Replace MinIO with Amazon S3.
Use EMR for Spark and MSK for Kafka.
Migrate Nessie Catalog to AWS Glue Catalog.
Deploy Fast API on EC2 or Fargate behind an ALB.
Use Trino on EMR or Athena for querying.

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

Built with:
Apache Iceberg | Spark | Kafka | Trino | Docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Data Lake Pipeline with Iceberg, Spark, and Kafka

📌 Features

🏗 Architecture

🛠 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Start the Docker Containers

Services

📂 Directory Structure

🖥 Usage

1. Upload JSON Files via API

2. Query Data with Trino

☁️ AWS Deployment

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
data		data
docs/images		docs/images
simulator		simulator
spark-jobs		spark-jobs
spark		spark
trino		trino
webservice		webservice
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

License

Elkoumy/real_time_data_lake

Folders and files

Latest commit

History

Repository files navigation

Real-Time Data Lake Pipeline with Iceberg, Spark, and Kafka

📌 Features

🏗 Architecture

🛠 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Start the Docker Containers

Services

📂 Directory Structure

🖥 Usage

1. Upload JSON Files via API

2. Query Data with Trino

☁️ AWS Deployment

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages