
(End-to-end architecture: Airflow orchestrates data movement, Terraform provisions AWS infra, MLflow tracks experiments, and Tableau visualizes results.)
This project demonstrates an end-to-end data engineering pipeline using the Inside Airbnb Los Angeles dataset (March 2025). The goal is to predict whether a listing will receive a high guest rating (≥ 4.5 stars) based on its metadata (e.g., room type, property type, host attributes).
The project is part of the WGU M.S. Data Analytics – Data Engineering specialization capstone.
- Airflow (Docker Compose) – Orchestrates the pipeline (ingestion → cleaning → model training → prediction).
- Terraform (AWS S3 + IAM) – Infrastructure-as-code for cloud storage.
- AWS S3 – Stores raw, cleaned, model, and prediction datasets.
- MLflow – Tracks model runs, parameters, and metrics.
- scikit-learn (Random Forest) – Trains the classification model.
- Tableau – Visualizes results.
.
├── airflow/ # Airflow DAGs, plugins, include/ (runtime files ignored)
├── data/ # Local development data (ignored)
├── notebooks/ # Exploratory analysis (outputs/ckpts ignored)
├── terraform/ # Terraform IaC for AWS buckets
├── docker-compose.yml # Multi-service orchestration
├── Dockerfile # Custom Airflow image
├── sync_env_from_tf # Script to sync TF outputs → .env
├── .env.example # Placeholder environment file
└── README.md
- Docker & Docker Compose
- Terraform
- AWS credentials with S3 access
Clone the repository:
git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>Set up environment:
cp .env.example .env
# Fill in AWS credentialsProvision infrastructure:
cd terraform
terraform init
terraform applySync Terraform outputs to .env:
./sync_env_from_tfdocker compose up -dAccess the Airflow UI at: http://localhost:8080
Trigger the DAG: ``
docker compose up -d mlflowAccess MLflow UI at: http://localhost:5000
Exported predictions can be visualized in Tableau dashboards.
- Key Finding: Listing characteristics such as
room_type,host_is_superhost, andproperty_typesignificantly influence guest ratings. - Model: Random Forest Classifier
- Accuracy: ~XX% (see MLflow for full metrics)
- Infrastructure codified in Terraform.
- Environment managed with Docker.
- Model tracking with MLflow.
.env.exampleincluded for safe setup.
.envand Terraform state files are git-ignored.- Do not commit AWS credentials.
- If secrets are ever committed, rotate immediately.
- Experiment with additional models (XGBoost, LightGBM).
- Expand to other Airbnb markets for generalization.
- Automate Tableau dashboard refresh via Airflow.