This repository provides a reusable starting point for building end-to-end real-time data pipelines with Apache Kafka, Apache Spark Structured Streaming, PostgreSQL, and a Flask REST API. A 🇫🇷 French version of this guide is available in README.fr.md.
Producer -> Kafka -> Spark Structured Streaming -> PostgreSQL <- Flask API
- Apache Kafka for event ingestion
- Apache Spark Structured Streaming for real-time processing
- PostgreSQL for durable storage
- Flask for REST exposure
- Docker Compose for local orchestration
.
├── checkpoint/ # Spark checkpoints (mounted volume)
├── docker/ # Dockerfiles (Spark, API, producer)
├── scripts/ # helper scripts (Kafka init, …)
├── src/
│ ├── api/ # Flask endpoints
│ ├── consumer/ # Spark job
│ ├── producer/ # Kafka producer
│ └── utils/ # shared helpers
├── tests/
│ ├── integration/ # Docker Compose smoke tests
│ └── unit/ # Spark unit tests
├── .env.example # environment template
├── docker-compose.yml # service orchestration
├── requirements.txt # runtime dependencies
├── requirements-dev.txt # test dependencies
└── README.md
- Copy
.env.exampleto.envand tweak the values if necessary. - Build the images:
docker compose build - Launch the stack:
docker compose up -d - Check the API:
curl http://localhost:5000/health
- Kafka (
9092) and Zookeeper (2181) - Spark monitoring UI (
4040,8080) - PostgreSQL (
5432) and pgAdmin (5050) - Flask API (
5000by default)
- Install the dependencies:
pip install -r requirements.txt -r requirements-dev.txt - Unit tests:
pytest tests/unit(skipped on Python ≥ 3.12 because PySpark 3.4 lacks support) - Integration smoke test (requires Docker):
RUN_INTEGRATION_TESTS=1 pytest tests/integration - Integration tests are disabled by default and only validate the Compose configuration.
- Extend the Spark logic through
parse_kafka_recordsandaggregate_eventsinsrc/consumer/streaming_job.py. - Adjust Kafka topics, schemas, and PostgreSQL tables to match your use case.
- Add observability (structured logging, metrics, tracing) before promoting to production.
- Initialize Git:
git init && git add . && git commit -m "Initial skeleton" - Create an empty repository on GitHub and link it:
git remote add origin https://github.com/<user>/<repo>.git - Push the main branch:
git push -u origin main
If you find this project helpful, consider supporting the developer by buying them a coffee!