Goals of this project:
- ETL and ELT automation with Airflow.
- Demonstrate ML application's value in real world jobs.
- Interpret plain data by intuitive chart or dashboard.
In this project, we use Airflow to orchestrate three fully functional pipelines. They are:
- ETL pipeline
- ELT pipeline
- Chart pipeline
We also build miscellaneous functions as downstream tasks of ETL/ELT pipeline. These functions cover 3 topics: ML model training, data visualization and printing format table. Meanwhile add Metabase case, as an extensoin of ELT and enhancement of data visualization.
The ETL pipeline will:
- Extract data from local csv file into a Pandas DataFrame.
- Transform the Pandas DataFrame.
- Load transformed Pandas DataFrame into a DuckDB.
ETL pipeline followed by a branch task. The branch task decides which downstream task will be triggered.
- One possible choice is ML task group, including data preprocess task, choose model task and model training task. The trained model could predict alcohol index of wine according to residual sugar, pH and other chemical or physical indices.
- The other choice is a report task, named "print_loaded_wine". It will print out basic statistics info of alcohol of different qualities in logs.
The ELT pipeline will:
- Extract data from local csv file into a Pandas DataFrame.(Share data with ETL pipeline)
- Load the Pandas DataFrame into a PostgreSQL.
- Transform the data from PostgreSQL.
ELT pipeline followed by "chart_kde" task. This task calculates the "kernel density estimate (KDE)" of alcohol of different qualities, with ELT's output, and draws a picture of the result, named "alcohol_kde.pdf".
To have a big picture we merge above tasks into one DAG and name it as "ETL_ELT_wine".
Metabase accesses the PostgreSQL as data source, for further data
analysis or BI jobs. For example, we visualise wine quality distribution by
following steps.
Steps:
- Add
winequalitydbas Metabase data source. - Select
winequalitydbfrom Databases. - Click on
Wine Datato view table contents.
- Select
>_ SQL queryfrom+ Newdrop list at the top right of screen. - Confirm or set
winequalitydbas selected database, from the left side of screen. - Put following sql in editor.
SELECT
quality,
COUNT(*) AS "count"
FROM
wine_data
GROUP BY
quality
ORDER BY
quality ASC- Click on
Run queryor press shortcutCtrl + enter. - Click on
Visualizationat the bottom left of screen. - Double click
Baricon. - Set
Bar options.
- X-axis: quality, Add Class as prefix.
- Y-axis: Count
- Click
Doneat bottom.
Then save this as a new question and add it into a dashboard. By similarly steps, create a Pie chart and add it into the same dashboard.
The results look like:
The chart pipeline will:
- Decide which task will be triggered by branch task.
- Extract data from PostgreSQL or DuckDB according to the branch task's return.
- Draw a chart with extracted data.
This pipeline is a specific case for Airflow Datasets feature. It will be automatically triggered, after dependent datasets updated. To emphasize the relation between this pipeline and former ETL and ELT pipelines, we named it as "ETL_ELT_wine_downstream".
A success "ETL_ELT_wine_downstream" run will create a chart like:
This project stands on the shoulders of giants:
- Airflow
- pandas
- numpy
- scikit-learn
- mlflow
- duckdb
- postgresql
- adbc-driver-postgresql
- Metabase
- matplotlib
Note Create a PostgreSQL database named
winequalitydb, and set URI according to real configuration, before run the DAGs.
Start MLflow Tracking Server by command.
mlflow server --host 127.0.0.1 --port 5000Then access MLflow UI at localhost:5000.
Start Airflow by command.
airflow standaloneAfter Airflow has started. Access the Airflow UI at localhost:8080.
Filter DAGs by tag wine_quality.
-
Trigger DAGs by Airflow UI:
Manually trigger
ETL_ELT_wineDAG by clicking on theTrigger DAGbutton on the right side of the DAG. Then confirm or sethigh_quality_thresholdandml_sample_count_thresholdvalue. Next click onTriggerbutton at the bottom left of the window.ETL_ELT_wine_downstreamDAG will be automatically triggered afterETL_ELT_wineDAG updated datasets.
-
Trigger DAGs by Airflow cli:
Run the following command from terminal.
airflow dags trigger ETL_ELT_wine
Observe the DAGs run according to the dependencies which have been set by Datasets feature. Following screenshots clearly indicate the relationship between the 2 DAGs.
ETL_ELT_wine is running.
ETL_ELT_wine_downstream is triggered.
ETL_ELT_wine_downstream finished. ETL_ELT_wine is still running.
ETL_ELT_wine finished.
A success ETL_ELT_wine DAG run looks like:
A success ETL_ELT_wine_downstream DAG run looks like:
MLflow as a part of ETL_ELT_wine DAG run, looks like:
This project contains the following files and folders:
dags:etl_elt_wine.py: a DAG performing ETL and ELT.etl_elt_wine_downstream.py: a DAG to draw chart report.include:utils.py: contains utility functions to support ETL, ELT, ML and other jobs.
data:winequality-white.csv: source data.
images: contains a series of images used in README.requirements.txt: required python packages.README.md: this Readme.
-
This project implements ETL, ELT and downstream tasks with Airflow. Deals with pratical problems, by applying XCom, datasets and dynamic task mapping features of Airflow. Both DAGs are writen in TaskFlow form rather than traditional operator form.
-
Highlights potential value of integrating ML application into real world jobs.
-
Demonstrates how to use Metabase for data analysis and visualization.
















