A comprehensive data platform for ML feature engineering with data quality validation using Great Expectations, dbt transformations, and Apache Airflow orchestration.
This project implements:
- Extract & Load pipelines from API/DB sources
- Data quality validation with Great Expectations
- dbt transformations (staging → intermediate → features)
- Feature store tables with point-in-time correctness
- Apache Airflow orchestration
- Comprehensive monitoring and observability
[Sources: API, DB, Files]
|
v
+-------------------+
| Extract & Load | Python/Spark -> Raw zone
+--------+----------+
|
v
+-------------------+
| Validate Raw | Great Expectations
+--------+----------+
|
v
+-------------------+
| dbt Transformations | staging -> intermediate -> features
+--------+----------+
|
v
+-------------------+
| Feature Build | User features, Order features (incremental)
+--------+----------+
|
v
+-------------------+
| Validate Features | Great Expectations
+--------+----------+
|
v
[Output: Feature store tables]
- Install dependencies:
pip install -r requirements.txt- Initialize Great Expectations:
great_expectations init- Run extract:
python pipelines/extract/extract_orders_api.py \
--from-date 2024-01-01 \
--to-date 2024-01-02 \
--output-path ./data/raw/orders- Run dbt:
cd dbt_project
dbt run- Build features:
python pipelines/features/build_user_features.py --as-of-date 2024-01-02MIT License
Mehdi Jahani