Welcome to the Data Mesh Hackathon! This workshop demonstrates how to build a modular, scalable, and collaborative data platform using modern open-source components.
This project uses the following tools, all running via Docker Compose:
| Component | Purpose |
|---|---|
| Apache Airflow | Data pipeline orchestration (ETL) |
| Trino | SQL query engine across MinIO/Iceberg |
| MinIO | S3-compatible object storage |
| Apache Iceberg | Table format for big data analytics (via Hive) |
| Hive Metastore | Catalog for Iceberg tables |
| Jupyter Notebook | Interactive analysis for data scientists |
| Apache Superset | Dashboards and visualizations for analysts |
Before running this project, make sure you have the following installed and configured on your machine:
Operating System: macOS, Linux, or Windows
Memory: minimum 8 GB RAM allocated to Docker
Tool Version (Recommended) Docker 20.10+ Container runtime Docker Compose v2.0+ Multi-container orchestration Git stable version Clone and manage the repository Python (Optional) 3.9+ Run local scripts/debug steps
[ Airflow ] --> [ XLS Processing + Transform ]
↓
[ Trino ] → [ Write to Iceberg / MinIO ]
↓
[ Jupyter Notebook ] ← View & Analyze
↓
[ Superset ] ← Build Dashboards
finos-hackathon/
├── airflow/ # Airflow DAGs and config
├── notebooks/ # Jupyter notebooks
├── dashboards/ # Superset dashboards (optional exports)
├── docker-compose.yml # Compose for all services
├── dags/
│ └── xls_etl_pipeline.py # Airflow DAG to process XLS files
├── scripts/
│ └── transform.py # Business logic for transformation
├── README.md # This file
- Airflow DAG triggers on new XLS files.
- Applies data transformation and business rules (via Python script).
- Extracts key fields and outputs a clean dataset.
- Writes processed data to Iceberg table on MinIO using Trino.
- Jupyter Notebook lets data scientists explore the processed data.
- Superset builds dashboards for analysts using the same Trino connection.
As part of the ETL pipeline, we standardize and clean incoming data using the following rules:
| Field | Raw Value | Transformed Value |
|-----------------------------|-----------|-------------------------------|
| **Channel** | R | Retail |
| | B | Broker |
| | C | Correspondent |
| | T | TPO Not Specified |
| | 9 | Not Available |
--------------------------------------------------------------------------
| **First Time Home Buyer** | Y | Yes |
| | N | No |
---------------------------------------------------------------------------
| **Loan Purpose** | P | Purchase |
| | C | Refinance - Cash Out |
| | N | Refinance - No Cash Out |
| | R | Refinance - Not Specified |
| | 9 | Not Available |
The following **key fields** are cleansed and extracted from the from source and persisted in S3 bucket:
```python
REQUIRED_COLUMNS = [
"Reference Pool ID", "Loan Identifier", "Monthly Reporting Period", "Channel",
"Seller Name", "Servicer Name", "Master Servicer", "Original Interest Rate",
"Current Interest Rate", "Original UPB", "UPB at Issuance", "Current Actual UPB",
"Original Loan Term", "Origination Date", "First Payment Date", "Loan Age",
"Remaining Months to Legal Maturity", "Remaining Months To Maturity", "Maturity Date",
"Original Loan to Value Ratio (LTV)", "Original Combined Loan to Value Ratio (CLTV)",
"Number of Borrowers", "Debt-To-Income (DTI)", "Borrower Credit Score at Origination",
"Co-Borrower Credit Score at Origination", "First Time Home Buyer Indicator",
"Loan Purpose", "Property Type", "Number of Units", "Occupancy Status",
"Property State", "Metropolitan Statistical Area (MSA)", "Zip Code Short",
"Mortgage Insurance Percentage", "Amortization Type", "Prepayment Penalty Indicator",
"Interest Only Loan Indicator", "Interest Only First Principal And Interest Payment Date",
"Months to Amortization", "Current Loan Delinquency Status", "Loan Payment History",
"Modification Flag"
]
git clone https://github.com/your-org/data-mesh-hackathon.git
cd data-mesh-hackathonThis deployment requires an active internet connection to pull container images from the Quay image repository.
If your system does not have internet access, please follow the Offline Deployment Guide for instructions on how to proceed without an internet connection.
docker-compose up
This project supports fully offline deployment using pre-saved Docker images.
Note: The docker_images.tar file will be provided during the hackathon.
- Run the following script to load images and start the application:
./load_docker_images.sh
-
This script will:
Load Docker images from docker_images.tar
Start Docker Compose using only local images
Avoid pulling from the internet or rebuilding any services
-
Alternatively, once the images are loaded, you can manually start the application with:
docker-compose up --no-build
This ensures the entire deployment runs without requiring an internet connection.
| Service | URL |credentials)
| | userid-passowrd
| ---------------- | ---------------------------------------------- |-----------------
| Airflow | [http://localhost:8080](http://localhost:8080) | airflow - airflow
| Jupyter Notebook | [http://localhost:8888](http://localhost:8888) | NA
| Superset | [http://localhost:8088](http://localhost:8088) | admin/admin
| MinIO Console | [http://localhost:9001](http://localhost:9001) | minioAdmin/minio1234
| Trino UI | [http://localhost:8081](http://localhost:8081) | Admin / NA
| Marquez Data | [http://localhost:3000](http://localhost:3000) | N/A
Lineage UI