This project demonstrates a near real-time data pipeline using Apache Pinot, Apache Superset, Apache Airflow, and Redpanda Kafka for ingesting and querying dimension and fact data.
Ensure you have the following installed:
- Docker and Docker Compose
- A web browser for accessing the services
Run the following command to build and start all services:
docker compose up --build -d- Airflow: http://localhost:8180
- Login with:
airflow/airflow
- Login with:
- Redpanda Console: http://localhost:8080
- Apache Pinot: http://localhost:9091
- Apache Superset: http://localhost:8088
- Login with:
admin/admin
- Login with:
- In Airflow, run the following DAGs once to generate dimension data:
account_dim_generatorbranch_dim_generatorcustomer_dim_generatordate_dim_generator
- Run the
transaction_facts_generatorDAG once. - Navigate to Redpanda Console and confirm the
transaction_factstopic has been created.
- Run the following Airflow DAGs in sequence:
schema_dagtable_dagload_dag
- Navigate to Apache Pinot, click on
TablesandQuery Console, and verify that data has been consumed from Kafka.
- Navigate to Apache Superset at http://localhost:8088.
- Login with:
admin/admin
- Login with:
- Connect a database:
- Go to
+ -> Data -> Connect a database. - Select
Apache Pinotas the database. - Fill in the SQLAlchemy URI:
pinot://pinot-broker:8099/query?server=http://pinot-controller:9000 - Click
TEST CONNECTIONand ensure "Looks good". - Press
Connect.
- Go to
-
Go to SQL Lab >
SQL LAB.- Choose a table and view its schema with
SEE TABLE SCHEMA.
- Choose a table and view its schema with
-
Run the following query in SQL Lab:
SELECT tf.*, CONCAT(cd.first_name, ' ', cd.last_name) AS full_name, email, phone_number, registration_date, branch_name, branch_address, city, state, zipcode, account_type, status, balance FROM transaction_facts tf LEFT JOIN account_dim ad ON tf.account_id = ad.account_id LEFT JOIN customer_dim cd ON tf.customer_id = cd.customer_id LEFT JOIN branch_dim bd ON tf.branch_id = bd.branch_id;
-
Save the query results as a dataset:
- Click
Save Dropdown>Save as New. - Name the dataset:
transaction_fact_combined. - Click
SAVE & EXPLORE.
- Click
-
Bar Chart:
- X-Axis:
branch_name - Metrics:
transaction_amount (SUM) - Row Limit: 10
- Name the chart:
Top 10 Profitable Branches. - Save it to the dashboard.
- X-Axis:
-
Big Number Chart:
- Metric:
Count - Name:
Total Records. - Save it to the dashboard.
- Metric:
-
Pie Chart (Currency Distribution):
- Dimensions:
currency - Metrics:
transaction_amount - Name:
Currency Distribution. - Save it to the dashboard.
- Dimensions:
-
Pie Chart (Account Type Distribution):
- Dimensions:
account_type - Metrics:
transaction_amount - Name:
Account Type Distribution. - Under
CUSTOMIZE, enable:SHOW TOTAL- Currency Format:
NIS
- Save it to the dashboard.
- Dimensions:
- Go to the dashboard, click
...>Edit. - Set the dashboard refresh rate to 10 seconds.
- In Airflow, run the
transaction_facts_generatorDAG to simulate new transactions. - Navigate to Superset > Dashboard.
- Watch as the values update dynamically with the new data from Kafka!
- For production, scale Kafka with more brokers to ensure reliability and high throughput.
- And Better way more the csvs from airflow to pinot.
- Airflow: Orchestrates data pipelines (http://localhost:8180).
- Redpanda (Kafka): Real-time message broker (http://localhost:8080).
- Pinot: OLAP datastore for real-time analytics (http://localhost:9091).
- Superset: Data visualization and dashboarding (http://localhost:8088).