This project implements a complete, end-to-end data lakehouse platform designed to simulate the ingestion, processing, and analysis of real-time ride-share event data. It is built entirely on a modern, open-source stack and is fully containerized with Docker Compose for easy and reproducible local development.
The platform simulates real-time events, publishing them to an Apache Kafka topic. These events are then consumed by an Apache Flink streaming job, which writes the raw data and computes real-time aggregations, storing them in an Apache Iceberg data lakehouse format.
The lakehouse is built with MinIO as the S3-compatible object storage layer and Project Nessie as a transactional, Git-like catalog for Iceberg, providing data versioning capabilities.
Data can be queried and analyzed via a standalone Apache Spark cluster, which includes a Thrift Server to allow SQL querying from BI tools. Batch processing and data maintenance tasks, such as creating dimension tables and running compaction, are orchestrated with Apache Airflow. Finally, the BI layer is served by Apache Superset, which connects to the lakehouse through the Spark Thrift Server to create dashboards and visualizations.
The platform is composed of several logical stacks, each managed by its own docker-compose file. The data flows from ingestion through processing to the analytics and BI layers.
---
config:
look: neo
theme: default
---
flowchart TD
subgraph s1[" "]
n13["agg tables"]
n9["fct tables"]
n10["dim tables"]
end
n8["Spark"] --> n9 & n10
n4["trip_id<br>trip_type_id<br>event_type<br>event_datetime<br>passenger_count<br>pickup_location<br>dropoff_location<br>trip_distance<br>payment_type"] --> n2["KAFKA"]
n3["User"] --- n4
n2 --> n5["Flink"]
n5 --> n13
n12["Airflow"] --> n8
n7["Iceberg Lakehouse"] --- s1
s1 --> n6["Live Dashboards<br>- Trending Pickup Zones<br>- Popular Dropoff Zones<br>- Dynamic pricing"] & n11["Final Tables<br>- Analytics Team<br>- ML Team"]
n9@{ shape: subproc}
n10@{ shape: subproc}
n13@{ shape: subproc}
n8@{ shape: hex}
n4@{ shape: internal-storage}
n2@{ shape: h-cyl}
n3@{ shape: dbl-circ}
n5@{ shape: hex}
n12@{ shape: event}
n7@{ shape: disk}
n6@{ shape: procs}
n11@{ shape: procs}
| Component | Technology | Role in the Platform |
|---|---|---|
| Ingestion | Apache Kafka | Provides a durable, high-throughput message queue for real-time events. |
| Real-Time | Apache Flink | Powers stateful stream processing and real-time aggregations with low latency. |
| Storage | MinIO (S3) | Acts as the scalable object storage layer for the data lakehouse. |
| Table Format | Apache Iceberg | Provides a high-performance, transactional table format for the data lake |
| Catalog | Project Nessie | Provides Git-like, transactional versioning for the Iceberg tables, enabling data governance. |
| Query Engine | Apache Spark | The primary engine for batch ETL, ad-hoc queries, and serving the BI layer. |
| Orchestration | Apache Airflow | Manages and schedules batch data pipelines, like dimension table creation and compaction. |
| BI Layer | Apache Superset | Provides an interactive web UI for building dashboards and visualizing data. |
---
config:
look: neo
theme: default
---
erDiagram
fct_ride_events {
string trip_id
int trip_type_id
string event_type
timestamp event_datetime
int passenger_count
int pickup_location
int dropoff_location
double trip_distance
int payment_type
timestamp event_time
date event_date
int event_hour
int date_key
timestamp processing_timestamp
}
dim_payment_type {
int payment_type_id PK
string payment_type_description
string payment_category
boolean is_customer_facing
boolean current
timestamp valid_from
timestamp valid_to
}
dim_trip_type {
string trip_type_id PK
string vehicle_class
int max_passengers
decimal base_fare_multiplier
boolean current
timestamp valid_from
timestamp valid_to
}
dim_zone {
int location_id PK
string borough
string zone
string service_zone
boolean current
timestamp valid_from
timestamp valid_to
}
dim_date {
int date_key PK
date full_date
int year
int month
int day
int quarter
string month_name
string day_name
string quarter_name
int day_of_week
int day_of_year
int week_of_year
boolean is_weekend
boolean is_month_start
boolean is_month_end
timestamp last_updated
}
agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding {
timestamp event_time PK
int pickup_location PK
int ride_count
double preceeding_5_min_avg
double demand_factor
date event_date
int event_hour
int date_key
timestamp processing_time
}
agg_dropoff_count_by_location_15m_tumble {
timestamp event_time PK
int dropoff_location PK
int ride_count
date event_date
int event_hour
int date_key
timestamp processing_time
}
fct_ride_events ||--o{ dim_payment_type : "by payment_type"
fct_ride_events ||--o{ dim_trip_type : "by trip_type"
fct_ride_events ||--o{ dim_zone : "by pickup_location"
fct_ride_events ||--o{ dim_zone : "by dropoff_location"
fct_ride_events ||--o{ dim_date : "by date_key"
agg_dropoff_count_by_location_15m_tumble ||--o{ dim_zone : "by dropoff_location"
agg_dropoff_count_by_location_15m_tumble ||--o{ dim_date : "by date_key"
agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_zone : "by pickup_location"
agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_date : "by date_key"
- Docker: Install Docker
- Docker Compose: Install Docker Compose
- GNU Make: Required to use the
Makefilecommands.
-
Clone the Repository:
git clone https://github.com/mxaviersmp/ride-share-olap.git cd ride-share-olap -
Environment Configuration: Create a
.envfile in the root of the project by copying the provided template. This file contains all the passwords, user names, and endpoints for the services.cp sample.env .env
Review the values in the
.envfile and change them if necessary. -
Build Docker Images: The project uses several custom Docker images for Flink, Spark, and Airflow. Build them all with a single command:
make build
-
Start All Services: This command will start all the services defined in the
docker-composefiles in detached mode. It's recommended to start services one stack at a time to manage resources.# Start the core data platform make lakehouse-up make kafka-up make flink-up # Start the orchestration and BI layers as needed make airflow-up make bi-up
Alternatively, to start everything at once:
make up
-
Data Generation The real-time simulation is based on a CSV file of events. This file can be generated from the raw NYC Taxi dataset Parquet files.
Download the Data: Download the raw Parquet files (e.g., for yellow and green taxis for a specific month) and place them in the ./data directory.
Run the Preparation Script: Execute the data preparation script to convert the Parquet files into a single, chronologically sorted CSV file that the event producer will use.
python data_prep/prepare_data.py \ --y_filepath ./data/yellow_tripdata_2025-01.parquet \ --g_filepath ./data/green_tripdata_2025-01.parquet \ --dst_filepath ./data/tripdata_2025-01.csv
-
Start the Data Producer: This command starts the Python script that simulates real-time events and sends them to Kafka. It runs in the background.
make produce-events
-
Submit the Flink Streaming Job: This command submits the main Flink job, which reads from Kafka, processes the data, and writes to the Iceberg tables.
make consume-events
To resume a stopped job from its last checkpoint, use:
make consume-events restore_checkpoint=true
- Kafka UI: http://localhost:8090
- Flink UI: http://localhost:8081
- MinIO Console: http://localhost:9001 (Use credentials from
.env) - Spark Master UI: http://localhost:9090
- Spark History Server: http://localhost:18080
- Superset UI: http://localhost:8088 (Use credentials from
.env) - Airflow UI: http://localhost:8080 (Use credentials from
.env)
- Superset: Connect to the Spark Thrift Server using the connection string
hive://spark-thrift-server:10000/defaultand query the tables in the SQL Lab and create Dashboards. - Spark Notebook: Access the Jupyter notebook at http://localhost:8888 to run interactive PySpark queries.
- Use the Airflow UI to trigger DAGs that submit Spark jobs for batch processing tasks like populating dimension tables or running table compaction.
The Makefile provides a convenient interface for managing the platform.
make help: Shows a list of all available commands.make up: Starts all services.make down: Stops all services.make destroy: Stops all services and removes their volumes (deletes all data).make build: Builds all custom Docker images.make logs: Tails the logs for all services.make <service>-up: Starts a specific service stack (e.g.,make kafka-up).make <service>-logs container=<container_name>: Tails the logs for a specific container (e.g.,make flink-logs container=taskmanager-1).make <service>-reset: Destroys, rebuilds, and starts a specific service stack.
