A scalable, near-real-time data pipeline for ingesting, processing, and querying JSON data using Apache Iceberg, Spark Structured Streaming, and Kafka. Designed for ACID-compliant storage, efficient upserts, and seamless cloud deployment.
- Near-Real-Time Ingestion: API layer with FastAPI for JSON file uploads and Kafka for event streaming.
- ACID-Compliant Storage: Apache Iceberg tables managed by Nessie Catalog for versioning and schema enforcement.
- Distributed Processing: Spark Structured Streaming with micro-batches (0.1s intervals) for validation, deduplication, and merging.
- Optimized Querying: Trino SQL engine for low-latency analytics and time-travel queries.
- Cloud-Ready: Dockerized components (MinIO, Kafka, Spark, Nessie, Trino) with AWS deployment guidelines.
- API Layer: FastAPI endpoints ingest JSON files into MinIO and publish metadata to Kafka.
- Event Streaming: Kafka decouples ingestion from processing, ensuring fault tolerance.
- Spark Processing: Micro-batch jobs validate, clean, and merge data into Iceberg tables.
- Iceberg Storage: Partitioned, compressed tables with ZSTD and automated compaction.
- Trino Analytics: SQL queries on Iceberg tables with Nessie versioning.
- Docker & Docker Compose
git clone https://github.com/Elkoumy/real_time_data_lake.gitcd real_time_data_lakedocker-compose up -dServices Included in the docker-compose file:
- MinIO: S3-compatible object storage for JSON files.
- Kafka: Distributed event streaming platform.
- Spark: Unified analytics engine for big data processing.
- Nessie: Git-like versioning for Iceberg tables.
- Trino: Distributed SQL query engine for Iceberg tables.
- FastAPI: Web API framework for JSON file uploads.
- Data Simulator: Python script for generating sample JSON data upload requests.
βββ webservice/ # FastAPI upload service
βββ simulator/ # Upload JSON data simulator
βββ spark-jobs/ # Spark Structured Streaming jobs
βββ trino/ # Trino configuration and queries
βββ docker-compose.yml # Orchestration
βββ docs/ # Architecture diagrams and notes
βββ data # Sample JSON data and schema
curl -X POST -F "file=@data/employees/employees_4.json" http://localhost:8000/upload/employees_4
docker exec -it trino trinoRun queries:
SELECT * FROM iceberg_datalake.default.sessions;To deploy on AWS:
- Replace MinIO with Amazon S3.
- Use EMR for Spark and MSK for Kafka.
- Migrate Nessie Catalog to AWS Glue Catalog.
- Deploy Fast API on EC2 or Fargate behind an ALB.
- Use Trino on EMR or Athena for querying.
This project is licensed under the MIT License. See the LICENSE file for details.
Built with:
Apache Iceberg | Spark | Kafka | Trino | Docker
