Skip to content

divakaivan/transaction-stream-data-pipeline

Repository files navigation

Project overview

transactions_stream_project_diagram

  • Stripe: Using Stripe's API as the source of generating realistic transaction data.

  • Apache Kafka: Stripe transaction data is streamed into Apache Kafka. It handles the data streams and ensures they are processed in real-time.

  • Apache ZooKeeper: ZooKeeper is used alongside Kafka to manage and coordinate the Kafka brokers. ZooKeeper helps maintain configuration information, naming, synchronization, and group services.

  • PySpark: The data from Kafka is then processed using PySpark Structured Streaming. This involves transforming individual transaction data into rows fit for a database.

  • Forex API: GBP/x exchange rates are taken from an online API and updated every 24 hours.

  • PostgreSQL: After processing, the data is stored in PostgreSQL.

  • dbt (Data Build Tool): dbt is used to manage and transform data within PostgreSQL. Data is split into dimension and a fact tables.

  • Grafana: Finally, the data stored in PostgreSQL is visualized using Grafana.

Data model

transactions_stream_data_model

dbt documentation

Link to the docs

dbt lineage

image

Visualisation

image

Considerations for improvements

  • add PySpark tests
  • use an orchestrator
  • use more data
    • Spark might be an overkill due to the data amount limitations, but I wanted to learn how to set Spark Streaming up in case data is much more
    • for a better Grafana visualisation
    • maybe find an alternative transactions data source because the Stripe API has a 25 rate limit
    • also many of the generated values in a transaction from the Stripe API are null

Setup

  1. git clone https://github.com/divakaivan/transaction-stream-data-pipeline.git
  2. Rename sample.env to .env and fill in the necessary environment variables
  3. Type make in the terminal to see the setup options
Usage: make [option]

Options:
  help                 Show this help message
  build                Build docker services
  start                Start docker services (detached mode)
  stop                 Stop docker services
  dbt-test             Run dbt with LIMIT 100
  dbt-full             Run dbt with full data

About

Transaction processing & vis pipeline using PySpark Streaming

Resources

Stars

Watchers

Forks