Skip to content

πŸš€ Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake

License

Notifications You must be signed in to change notification settings

Elkoumy/real_time_data_lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time Data Lake Pipeline with Iceberg, Spark, and Kafka

A scalable, near-real-time data pipeline for ingesting, processing, and querying JSON data using Apache Iceberg, Spark Structured Streaming, and Kafka. Designed for ACID-compliant storage, efficient upserts, and seamless cloud deployment.

πŸ“Œ Features

  • Near-Real-Time Ingestion: API layer with FastAPI for JSON file uploads and Kafka for event streaming.
  • ACID-Compliant Storage: Apache Iceberg tables managed by Nessie Catalog for versioning and schema enforcement.
  • Distributed Processing: Spark Structured Streaming with micro-batches (0.1s intervals) for validation, deduplication, and merging.
  • Optimized Querying: Trino SQL engine for low-latency analytics and time-travel queries.
  • Cloud-Ready: Dockerized components (MinIO, Kafka, Spark, Nessie, Trino) with AWS deployment guidelines.

πŸ— Architecture

Data Lake Pipeline Architecture

  1. API Layer: FastAPI endpoints ingest JSON files into MinIO and publish metadata to Kafka.
  2. Event Streaming: Kafka decouples ingestion from processing, ensuring fault tolerance.
  3. Spark Processing: Micro-batch jobs validate, clean, and merge data into Iceberg tables.
  4. Iceberg Storage: Partitioned, compressed tables with ZSTD and automated compaction.
  5. Trino Analytics: SQL queries on Iceberg tables with Nessie versioning.

πŸ›  Prerequisites

  • Docker & Docker Compose

πŸš€ Getting Started

1. Clone the Repository

git clone https://github.com/Elkoumy/real_time_data_lake.git

2. Start the Docker Containers

cd real_time_data_lake
docker-compose up -d

Services

Services Included in the docker-compose file:

  • MinIO: S3-compatible object storage for JSON files.
  • Kafka: Distributed event streaming platform.
  • Spark: Unified analytics engine for big data processing.
  • Nessie: Git-like versioning for Iceberg tables.
  • Trino: Distributed SQL query engine for Iceberg tables.
  • FastAPI: Web API framework for JSON file uploads.
  • Data Simulator: Python script for generating sample JSON data upload requests.

πŸ“‚ Directory Structure

β”œβ”€β”€ webservice/            # FastAPI upload service
β”œβ”€β”€ simulator/             # Upload JSON data simulator
β”œβ”€β”€ spark-jobs/            # Spark Structured Streaming jobs
β”œβ”€β”€ trino/                 # Trino configuration and queries
β”œβ”€β”€ docker-compose.yml     # Orchestration
β”œβ”€β”€ docs/                  # Architecture diagrams and notes
└── data                   # Sample JSON data and schema

πŸ–₯ Usage

1. Upload JSON Files via API

curl -X POST -F "file=@data/employees/employees_4.json" http://localhost:8000/upload/employees_4

2. Query Data with Trino

docker exec -it trino trino

Run queries:

SELECT * FROM iceberg_datalake.default.sessions;

☁️ AWS Deployment

To deploy on AWS:

  1. Replace MinIO with Amazon S3.
  2. Use EMR for Spark and MSK for Kafka.
  3. Migrate Nessie Catalog to AWS Glue Catalog.
  4. Deploy Fast API on EC2 or Fargate behind an ALB.
  5. Use Trino on EMR or Athena for querying.

πŸ“ License

This project is licensed under the MIT License. See the LICENSE file for details.


Built with:
Apache Iceberg | Spark | Kafka | Trino | Docker
License Spark Docker GitHub Stars

About

πŸš€ Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published