Skip to content

mehdi-jahani/ml-feature-pipeline-data-quality

Repository files navigation

ML Feature Pipeline & Data Quality Platform

A comprehensive data platform for ML feature engineering with data quality validation using Great Expectations, dbt transformations, and Apache Airflow orchestration.

Overview

This project implements:

  • Extract & Load pipelines from API/DB sources
  • Data quality validation with Great Expectations
  • dbt transformations (staging → intermediate → features)
  • Feature store tables with point-in-time correctness
  • Apache Airflow orchestration
  • Comprehensive monitoring and observability

Architecture

[Sources: API, DB, Files]
        |
        v
+-------------------+
| Extract & Load    |  Python/Spark -> Raw zone
+--------+----------+
         |
         v
+-------------------+
| Validate Raw      |  Great Expectations
+--------+----------+
         |
         v
+-------------------+
| dbt Transformations |  staging -> intermediate -> features
+--------+----------+
         |
         v
+-------------------+
| Feature Build     |  User features, Order features (incremental)
+--------+----------+
         |
         v
+-------------------+
| Validate Features |  Great Expectations
+--------+----------+
         |
         v
[Output: Feature store tables]

Quick Start

  1. Install dependencies:
pip install -r requirements.txt
  1. Initialize Great Expectations:
great_expectations init
  1. Run extract:
python pipelines/extract/extract_orders_api.py \
  --from-date 2024-01-01 \
  --to-date 2024-01-02 \
  --output-path ./data/raw/orders
  1. Run dbt:
cd dbt_project
dbt run
  1. Build features:
python pipelines/features/build_user_features.py --as-of-date 2024-01-02

License

MIT License

Author

Mehdi Jahani

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors