Open Data Lakehouse Ride-Share Analytics Platform

1. Overview

This project implements a complete, end-to-end data lakehouse platform designed to simulate the ingestion, processing, and analysis of real-time ride-share event data. It is built entirely on a modern, open-source stack and is fully containerized with Docker Compose for easy and reproducible local development.

The platform simulates real-time events, publishing them to an Apache Kafka topic. These events are then consumed by an Apache Flink streaming job, which writes the raw data and computes real-time aggregations, storing them in an Apache Iceberg data lakehouse format.

The lakehouse is built with MinIO as the S3-compatible object storage layer and Project Nessie as a transactional, Git-like catalog for Iceberg, providing data versioning capabilities.

Data can be queried and analyzed via a standalone Apache Spark cluster, which includes a Thrift Server to allow SQL querying from BI tools. Batch processing and data maintenance tasks, such as creating dimension tables and running compaction, are orchestrated with Apache Airflow. Finally, the BI layer is served by Apache Superset, which connects to the lakehouse through the Spark Thrift Server to create dashboards and visualizations.

2. Architecture

The platform is composed of several logical stacks, each managed by its own docker-compose file. The data flows from ingestion through processing to the analytics and BI layers.

System Design

---
config:
  look: neo
  theme: default
---

flowchart TD
 subgraph s1[" "]
        n13["agg tables"]
        n9["fct tables"]
        n10["dim tables"]
  end
    n8["Spark"] --> n9 & n10
    n4["trip_id<br>trip_type_id<br>event_type<br>event_datetime<br>passenger_count<br>pickup_location<br>dropoff_location<br>trip_distance<br>payment_type"] --> n2["KAFKA"]
    n3["User"] --- n4
    n2 --> n5["Flink"]
    n5 --> n13
    n12["Airflow"] --> n8
    n7["Iceberg Lakehouse"] --- s1
    s1 --> n6["Live Dashboards<br>- Trending Pickup Zones<br>- Popular Dropoff Zones<br>- Dynamic pricing"] & n11["Final Tables<br>- Analytics Team<br>- ML Team"]
    n9@{ shape: subproc}
    n10@{ shape: subproc}
    n13@{ shape: subproc}
    n8@{ shape: hex}
    n4@{ shape: internal-storage}
    n2@{ shape: h-cyl}
    n3@{ shape: dbl-circ}
    n5@{ shape: hex}
    n12@{ shape: event}
    n7@{ shape: disk}
    n6@{ shape: procs}
    n11@{ shape: procs}

Component	Technology	Role in the Platform
Ingestion	Apache Kafka	Provides a durable, high-throughput message queue for real-time events.
Real-Time	Apache Flink	Powers stateful stream processing and real-time aggregations with low latency.
Storage	MinIO (S3)	Acts as the scalable object storage layer for the data lakehouse.
Table Format	Apache Iceberg	Provides a high-performance, transactional table format for the data lake
Catalog	Project Nessie	Provides Git-like, transactional versioning for the Iceberg tables, enabling data governance.
Query Engine	Apache Spark	The primary engine for batch ETL, ad-hoc queries, and serving the BI layer.
Orchestration	Apache Airflow	Manages and schedules batch data pipelines, like dimension table creation and compaction.
BI Layer	Apache Superset	Provides an interactive web UI for building dashboards and visualizing data.

Data Model

---
config:
  look: neo
  theme: default
---
erDiagram
    fct_ride_events {
        string trip_id
        int trip_type_id
        string event_type
        timestamp event_datetime
        int passenger_count
        int pickup_location
        int dropoff_location
        double trip_distance
        int payment_type
        timestamp event_time
        date event_date
        int event_hour
        int date_key
        timestamp processing_timestamp
    }
    dim_payment_type {
        int payment_type_id PK
        string payment_type_description
        string payment_category
        boolean is_customer_facing
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_trip_type {
        string trip_type_id PK
        string vehicle_class
                int max_passengers
        decimal base_fare_multiplier
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_zone {
        int location_id PK
        string borough
        string zone
        string service_zone
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_date {
        int date_key PK
        date full_date
        int year
        int month
        int day
        int quarter
        string month_name
        string day_name
        string quarter_name
        int day_of_week
        int day_of_year
        int week_of_year
        boolean is_weekend
        boolean is_month_start
        boolean is_month_end
        timestamp last_updated
    }
    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding {
        timestamp event_time PK
        int pickup_location PK
        int ride_count
        double preceeding_5_min_avg
        double demand_factor
        date event_date
        int event_hour
        int date_key
        timestamp processing_time
    }
    agg_dropoff_count_by_location_15m_tumble {
        timestamp event_time PK
        int dropoff_location PK
        int ride_count
        date event_date
        int event_hour
        int date_key
        timestamp processing_time
    }

    fct_ride_events ||--o{ dim_payment_type : "by payment_type"
    fct_ride_events ||--o{ dim_trip_type : "by trip_type"
    fct_ride_events ||--o{ dim_zone : "by pickup_location"
    fct_ride_events ||--o{ dim_zone : "by dropoff_location"
    fct_ride_events ||--o{ dim_date : "by date_key"

    agg_dropoff_count_by_location_15m_tumble ||--o{ dim_zone : "by dropoff_location"
    agg_dropoff_count_by_location_15m_tumble ||--o{ dim_date : "by date_key"

    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_zone : "by pickup_location"
    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_date : "by date_key"

3. Prerequisites

Docker: Install Docker
Docker Compose: Install Docker Compose
GNU Make: Required to use the Makefile commands.

4. Setup

Clone the Repository:

git clone https://github.com/mxaviersmp/ride-share-olap.git
cd ride-share-olap

Environment Configuration: Create a .env file in the root of the project by copying the provided template. This file contains all the passwords, user names, and endpoints for the services.
```
cp sample.env .env
```
Review the values in the .env file and change them if necessary.
Build Docker Images: The project uses several custom Docker images for Flink, Spark, and Airflow. Build them all with a single command:
```
make build
```
Start All Services: This command will start all the services defined in the docker-compose files in detached mode. It's recommended to start services one stack at a time to manage resources.
```
# Start the core data platform
make lakehouse-up
make kafka-up
make flink-up

# Start the orchestration and BI layers as needed
make airflow-up
make bi-up
```
Alternatively, to start everything at once:
```
make up
```
Data Generation The real-time simulation is based on a CSV file of events. This file can be generated from the raw NYC Taxi dataset Parquet files.

Download the Data: Download the raw Parquet files (e.g., for yellow and green taxis for a specific month) and place them in the ./data directory.

Run the Preparation Script: Execute the data preparation script to convert the Parquet files into a single, chronologically sorted CSV file that the event producer will use.
```
python data_prep/prepare_data.py \
    --y_filepath ./data/yellow_tripdata_2025-01.parquet \
    --g_filepath ./data/green_tripdata_2025-01.parquet \
    --dst_filepath ./data/tripdata_2025-01.csv
```

5. Usage

5.1. Running the Data Pipeline

Start the Data Producer: This command starts the Python script that simulates real-time events and sends them to Kafka. It runs in the background.
```
make produce-events
```
Submit the Flink Streaming Job: This command submits the main Flink job, which reads from Kafka, processes the data, and writes to the Iceberg tables.
```
make consume-events
```
To resume a stopped job from its last checkpoint, use:
```
make consume-events restore_checkpoint=true
```

5.2. Accessing UIs and Services

Kafka UI: http://localhost:8090
Flink UI: http://localhost:8081
MinIO Console: http://localhost:9001 (Use credentials from .env)
Spark Master UI: http://localhost:9090
Spark History Server: http://localhost:18080
Superset UI: http://localhost:8088 (Use credentials from .env)
Airflow UI: http://localhost:8080 (Use credentials from .env)

5.3. Querying Data

Superset: Connect to the Spark Thrift Server using the connection string hive://spark-thrift-server:10000/default and query the tables in the SQL Lab and create Dashboards.
- Some pre-made Datasets and Charts are available in the superset/examples folder.
- The instructions for using the Live Map are in the spark/code/zone-heatmap.ipynb notebook.
Spark Notebook: Access the Jupyter notebook at http://localhost:8888 to run interactive PySpark queries.

5.4. Orchestrating Batch Jobs

Use the Airflow UI to trigger DAGs that submit Spark jobs for batch processing tasks like populating dimension tables or running table compaction.

6. Makefile Commands

The Makefile provides a convenient interface for managing the platform.

make help: Shows a list of all available commands.
make up: Starts all services.
make down: Stops all services.
make destroy: Stops all services and removes their volumes (deletes all data).
make build: Builds all custom Docker images.
make logs: Tails the logs for all services.
make <service>-up: Starts a specific service stack (e.g., make kafka-up).
make <service>-logs container=<container_name>: Tails the logs for a specific container (e.g., make flink-logs container=taskmanager-1).
make <service>-reset: Destroys, rebuilds, and starts a specific service stack.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
airflow		airflow
kafka-app		kafka-app
pyflink-app		pyflink-app
spark		spark
superset		superset
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.airflow.yml		docker-compose.airflow.yml
docker-compose.bi.yml		docker-compose.bi.yml
docker-compose.flink.yml		docker-compose.flink.yml
docker-compose.kafka.yml		docker-compose.kafka.yml
docker-compose.lakehouse.yml		docker-compose.lakehouse.yml
environment.yml		environment.yml
makefile		makefile
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Data Lakehouse Ride-Share Analytics Platform

1. Overview

2. Architecture

System Design

Data Model

3. Prerequisites

4. Setup

5. Usage

5.1. Running the Data Pipeline

5.2. Accessing UIs and Services

5.3. Querying Data

5.4. Orchestrating Batch Jobs

6. Makefile Commands

About

Uh oh!

Releases

Packages

Languages

License

mxaviersmp/ride-share-olap

Folders and files

Latest commit

History

Repository files navigation

Open Data Lakehouse Ride-Share Analytics Platform

1. Overview

2. Architecture

System Design

Data Model

3. Prerequisites

4. Setup

5. Usage

5.1. Running the Data Pipeline

5.2. Accessing UIs and Services

5.3. Querying Data

5.4. Orchestrating Batch Jobs

6. Makefile Commands

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages