This project showcases an ETL (Extract, Transform, Load) pipeline that integrates a MinIO S3 bucket for file ingestion, processes uploaded files, and stores the data in a PostgreSQL database. The entire system is containerized using Docker and features Grafana for visualization. Airflow is used to orchestrate the pipeline.
To provide a useful example of an ETL pipeline, this project utilizes a coffee dataset that includes historical data on coffee production by country, providing insights into global coffee trends. Additionally, the pipeline is able to process data that tracks the evolution of coffee prices over time, allowing for comprehensive analysis and visualization of the coffee market dynamics.
- Docker (Check installation wiki)
The project is divided into two Docker Compose stacks:
- Main stack: Includes MinIO, Grafana, a Python script to initialize the system, and the PostgreSQL database.
- Airflow stack: Contains Airflow and its internal services for orchestration.
-
Clone the repository:
git clone https://github.com/zemendes1/ETL-Pipeline-Coffee-Data.git
-
Build and Run the Main Stack:
Build and start the main services (MinIO, PostgreSQL, and Grafana):
docker compose build docker compose up
-
Build and run the application:
Navigate to the Airflow directory, set up its environment, and start the services:
cd Airflow mkdir -p ./dags ./logs ./plugins ./config echo -e "AIRFLOW_UID=$(id -u)" > .env docker compose up airflow-init docker compose up
NOTE: If using a platform other than linux/amd64, you may need to modify docker-compose.yml to match your architecture. Refer to Airflow's official documentation for details.
-
Configure MinIO:
-
Access the MinIO Web UI, which should be located at http://127.0.0.1:9001/.
-
Log in with default credentials (username: ROOTNAME, password: CHANGEME123).
-
Create an S3 bucket named coffee-dataset-example.
-
Set up an event notification to send data to the PostgreSQL database.
Figure 1: Setting up an event destination from MinIO to PostgreSQL.
Figure 2: Configuring event notifications for file uploads.-
Upload example files (Coffee_production.csv and Coffee_domestic_consumption.csv) in the folder "Example-CSV-Files/Test-Data" to the bucket.
-
Restart the Docker services and verify that data has been ingested into the database tables coffee_production and coffee_domestic_consumption.
-
-
Configure Airflow:
-
Access the Airflow Web UI, which should be located at http://127.0.0.1:8080/.
-
Log in with default credentials (username: airflow, password: airflow).
-
Create connections to the following:
- MinIO S3 bucket
- PostgreSQL database
Figure 3: Connection setup from Airflow to MinIO.
Figure 4: Connection setup from Airflow to PostgreSQL.
Figure 5: Connections overview in Airflow.- Enable the DAG file_upload_processor in the Airflow dashboard. This DAG runs every 5 minutes to monitor new file uploads.
-
-
Configure Grafana: Lastly, we can create a simple dashboard in Grafana to proudly display our ETL and testing data.
-
Access the Grafana Web UI, which should be located at http://127.0.0.1:3111/.
-
Create a data source connection to the PostgreSQL database.
Figure 7: Connecting Grafana to PostgreSQL.- Design a dashboard to visualize your ETL pipeline data. Refer to the dashboard folder in this repository for examples.
-
- MinIO S3 Bucket: Manages file uploads and triggers notifications.
- Airflow DAG: Automates ETL processing.
- PostgreSQL Database: Stores transformed data.
- Grafana: Visualizes pipeline performance and data insights.
- Ensure Docker is installed and running correctly.
- Verify all services are accessible at the specified URLs.
- Review the logs (./logs) for detailed error information.
Feel free to fork this repository and submit pull requests. Contributions are welcome!
This project is licensed under the MIT License.


