Skip to content

E2E open source data lakehouse platform designed to simulate the ingestion, processing, and analysis of real-time ride-share event data

License

Notifications You must be signed in to change notification settings

mxaviersmp/ride-share-olap

Repository files navigation

Open Data Lakehouse Ride-Share Analytics Platform

1. Overview

This project implements a complete, end-to-end data lakehouse platform designed to simulate the ingestion, processing, and analysis of real-time ride-share event data. It is built entirely on a modern, open-source stack and is fully containerized with Docker Compose for easy and reproducible local development.

The platform simulates real-time events, publishing them to an Apache Kafka topic. These events are then consumed by an Apache Flink streaming job, which writes the raw data and computes real-time aggregations, storing them in an Apache Iceberg data lakehouse format.

The lakehouse is built with MinIO as the S3-compatible object storage layer and Project Nessie as a transactional, Git-like catalog for Iceberg, providing data versioning capabilities.

Data can be queried and analyzed via a standalone Apache Spark cluster, which includes a Thrift Server to allow SQL querying from BI tools. Batch processing and data maintenance tasks, such as creating dimension tables and running compaction, are orchestrated with Apache Airflow. Finally, the BI layer is served by Apache Superset, which connects to the lakehouse through the Spark Thrift Server to create dashboards and visualizations.


2. Architecture

The platform is composed of several logical stacks, each managed by its own docker-compose file. The data flows from ingestion through processing to the analytics and BI layers.

System Design

---
config:
  look: neo
  theme: default
---

flowchart TD
 subgraph s1[" "]
        n13["agg tables"]
        n9["fct tables"]
        n10["dim tables"]
  end
    n8["Spark"] --> n9 & n10
    n4["trip_id<br>trip_type_id<br>event_type<br>event_datetime<br>passenger_count<br>pickup_location<br>dropoff_location<br>trip_distance<br>payment_type"] --> n2["KAFKA"]
    n3["User"] --- n4
    n2 --> n5["Flink"]
    n5 --> n13
    n12["Airflow"] --> n8
    n7["Iceberg Lakehouse"] --- s1
    s1 --> n6["Live Dashboards<br>- Trending Pickup Zones<br>- Popular Dropoff Zones<br>- Dynamic pricing"] & n11["Final Tables<br>- Analytics Team<br>- ML Team"]
    n9@{ shape: subproc}
    n10@{ shape: subproc}
    n13@{ shape: subproc}
    n8@{ shape: hex}
    n4@{ shape: internal-storage}
    n2@{ shape: h-cyl}
    n3@{ shape: dbl-circ}
    n5@{ shape: hex}
    n12@{ shape: event}
    n7@{ shape: disk}
    n6@{ shape: procs}
    n11@{ shape: procs}

Loading
Component Technology Role in the Platform
Ingestion Apache Kafka Provides a durable, high-throughput message queue for real-time events.
Real-Time Apache Flink Powers stateful stream processing and real-time aggregations with low latency.
Storage MinIO (S3) Acts as the scalable object storage layer for the data lakehouse.
Table Format Apache Iceberg Provides a high-performance, transactional table format for the data lake
Catalog Project Nessie Provides Git-like, transactional versioning for the Iceberg tables, enabling data governance.
Query Engine Apache Spark The primary engine for batch ETL, ad-hoc queries, and serving the BI layer.
Orchestration Apache Airflow Manages and schedules batch data pipelines, like dimension table creation and compaction.
BI Layer Apache Superset Provides an interactive web UI for building dashboards and visualizing data.

Data Model

---
config:
  look: neo
  theme: default
---
erDiagram
    fct_ride_events {
        string trip_id
        int trip_type_id
        string event_type
        timestamp event_datetime
        int passenger_count
        int pickup_location
        int dropoff_location
        double trip_distance
        int payment_type
        timestamp event_time
        date event_date
        int event_hour
        int date_key
        timestamp processing_timestamp
    }
    dim_payment_type {
        int payment_type_id PK
        string payment_type_description
        string payment_category
        boolean is_customer_facing
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_trip_type {
        string trip_type_id PK
        string vehicle_class
                int max_passengers
        decimal base_fare_multiplier
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_zone {
        int location_id PK
        string borough
        string zone
        string service_zone
        boolean current
        timestamp valid_from
        timestamp valid_to
    }
    dim_date {
        int date_key PK
        date full_date
        int year
        int month
        int day
        int quarter
        string month_name
        string day_name
        string quarter_name
        int day_of_week
        int day_of_year
        int week_of_year
        boolean is_weekend
        boolean is_month_start
        boolean is_month_end
        timestamp last_updated
    }
    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding {
        timestamp event_time PK
        int pickup_location PK
        int ride_count
        double preceeding_5_min_avg
        double demand_factor
        date event_date
        int event_hour
        int date_key
        timestamp processing_time
    }
    agg_dropoff_count_by_location_15m_tumble {
        timestamp event_time PK
        int dropoff_location PK
        int ride_count
        date event_date
        int event_hour
        int date_key
        timestamp processing_time
    }

    fct_ride_events ||--o{ dim_payment_type : "by payment_type"
    fct_ride_events ||--o{ dim_trip_type : "by trip_type"
    fct_ride_events ||--o{ dim_zone : "by pickup_location"
    fct_ride_events ||--o{ dim_zone : "by dropoff_location"
    fct_ride_events ||--o{ dim_date : "by date_key"

    agg_dropoff_count_by_location_15m_tumble ||--o{ dim_zone : "by dropoff_location"
    agg_dropoff_count_by_location_15m_tumble ||--o{ dim_date : "by date_key"

    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_zone : "by pickup_location"
    agg_pickup_demand_by_location_1m_30m_hop_5m_preceeding ||--o{ dim_date : "by date_key"
Loading

3. Prerequisites


4. Setup

  1. Clone the Repository:

    git clone https://github.com/mxaviersmp/ride-share-olap.git
    cd ride-share-olap
  2. Environment Configuration: Create a .env file in the root of the project by copying the provided template. This file contains all the passwords, user names, and endpoints for the services.

    cp sample.env .env

    Review the values in the .env file and change them if necessary.

  3. Build Docker Images: The project uses several custom Docker images for Flink, Spark, and Airflow. Build them all with a single command:

    make build
  4. Start All Services: This command will start all the services defined in the docker-compose files in detached mode. It's recommended to start services one stack at a time to manage resources.

    # Start the core data platform
    make lakehouse-up
    make kafka-up
    make flink-up
    
    # Start the orchestration and BI layers as needed
    make airflow-up
    make bi-up

    Alternatively, to start everything at once:

    make up
  5. Data Generation The real-time simulation is based on a CSV file of events. This file can be generated from the raw NYC Taxi dataset Parquet files.

    Download the Data: Download the raw Parquet files (e.g., for yellow and green taxis for a specific month) and place them in the ./data directory.

    Run the Preparation Script: Execute the data preparation script to convert the Parquet files into a single, chronologically sorted CSV file that the event producer will use.

    python data_prep/prepare_data.py \
        --y_filepath ./data/yellow_tripdata_2025-01.parquet \
        --g_filepath ./data/green_tripdata_2025-01.parquet \
        --dst_filepath ./data/tripdata_2025-01.csv

5. Usage

5.1. Running the Data Pipeline

  1. Start the Data Producer: This command starts the Python script that simulates real-time events and sends them to Kafka. It runs in the background.

    make produce-events
  2. Submit the Flink Streaming Job: This command submits the main Flink job, which reads from Kafka, processes the data, and writes to the Iceberg tables.

    make consume-events

    To resume a stopped job from its last checkpoint, use:

    make consume-events restore_checkpoint=true

5.2. Accessing UIs and Services

5.3. Querying Data

  • Superset: Connect to the Spark Thrift Server using the connection string hive://spark-thrift-server:10000/default and query the tables in the SQL Lab and create Dashboards.
    • Some pre-made Datasets and Charts are available in the superset/examples folder.
    • The instructions for using the Live Map are in the spark/code/zone-heatmap.ipynb notebook. dashborad
  • Spark Notebook: Access the Jupyter notebook at http://localhost:8888 to run interactive PySpark queries.

5.4. Orchestrating Batch Jobs

  • Use the Airflow UI to trigger DAGs that submit Spark jobs for batch processing tasks like populating dimension tables or running table compaction.

6. Makefile Commands

The Makefile provides a convenient interface for managing the platform.

  • make help: Shows a list of all available commands.
  • make up: Starts all services.
  • make down: Stops all services.
  • make destroy: Stops all services and removes their volumes (deletes all data).
  • make build: Builds all custom Docker images.
  • make logs: Tails the logs for all services.
  • make <service>-up: Starts a specific service stack (e.g., make kafka-up).
  • make <service>-logs container=<container_name>: Tails the logs for a specific container (e.g., make flink-logs container=taskmanager-1).
  • make <service>-reset: Destroys, rebuilds, and starts a specific service stack.

About

E2E open source data lakehouse platform designed to simulate the ingestion, processing, and analysis of real-time ride-share event data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published