Vacation Rentals ML Pipeline

Production-Grade ML System for Location Retrieval

End-to-end ML pipeline implementing contextual multi-armed bandits for vacation rental location retrieval, inspired by Airbnb's RL approach.

🎯 Overview

This project demonstrates how to build a production-ready machine learning system that dynamically determines relevant search areas for vacation rental queries using:

Deep Learning: PyTorch Lightning neural network with custom loss function
Data Engineering: dbt transformations with DuckDB
Orchestration: Apache Airflow (via Astro CLI)
Experiment Tracking: MLflow for model lifecycle management
Reinforcement Learning: Monte Carlo Dropout + UCB exploration for contextual bandits
CI/CD: GitHub Actions for automated testing

🚀 Quick Start

# 1. Install Astro CLI
brew install astro  # or: pip install astro-cli

# 2. Start services (Airflow + MLflow)
astro dev start

# 3. Access UIs
# Airflow: http://localhost:8080 (admin/admin)
# MLflow:  http://localhost:5000

# 4. Prepare data (download Inside Airbnb data)
cd data/raw
wget http://data.insideairbnb.com/germany/bv/munich/2024-09-24/data/listings.csv.gz
wget http://data.insideairbnb.com/germany/bv/munich/2024-09-24/data/reviews.csv.gz
cd ../..

# 5. Run pipelines
astro dev run dags trigger data_pipeline
astro dev run dags trigger ml_training_pipeline

# 6. View results
open http://localhost:5000  # MLflow experiments

See QUICKSTART.md for detailed setup instructions.

📐 Architecture

┌─────────────────────────────────────────┐
│  Apache Airflow 2.8 (Orchestration)    │
│         via Astro CLI                   │
└─────────────────────────────────────────┘
         │           │            │
         ▼           ▼            ▼
    ┌───────┐  ┌──────────┐  ┌────────┐
    │  dbt  │  │ PyTorch  │  │ MLflow │
    │ 1.7   │  │ Light.   │  │  2.10  │
    └───────┘  └──────────┘  └────────┘
         │           │            │
         └───────────┴────────────┘
                     │
         ┌───────────────────────┐
         │ DuckDB 1.0 + Parquet  │
         └───────────────────────┘

Data Flow: Raw CSV → DuckDB → dbt → Parquet → PyTorch → MLflow

🎯 Key Features

Orchestration

✅ Automated pipelines with Apache Airflow
✅ Scheduled execution and monitoring
✅ Error handling and retries
✅ Task dependencies and parallelism

Data Engineering

✅ SQL-based transformations with dbt
✅ Data quality tests
✅ Incremental models
✅ Documentation generation

Machine Learning

✅ PyTorch Lightning for scalable training
✅ Custom loss function (containment + area + validity)
✅ Monte Carlo Dropout for uncertainty estimation
✅ UCB exploration strategy
✅ Early stopping and checkpointing

MLOps

✅ MLflow experiment tracking
✅ Model registry and versioning
✅ Artifact storage
✅ Hyperparameter logging

DevOps

✅ Docker-based infrastructure
✅ GitHub Actions CI/CD
✅ Automated testing (pytest, dbt test)
✅ Code quality checks (black, isort, flake8, mypy)

📁 Project Structure

vacation_rentals/
├── airflow/              # Orchestration
│   └── dags/
│       ├── data_pipeline_dag.py       # dbt transformations
│       └── ml_training_dag.py         # PyTorch training
│
├── dbt/                  # Data transformation
│   ├── models/
│   │   ├── staging/      # Clean raw data
│   │   ├── features/     # Feature engineering
│   │   └── exports/      # Parquet exports
│   └── dbt_project.yml
│
├── ml/                   # Machine learning
│   ├── models/
│   │   ├── location_model.py         # PyTorch Lightning
│   │   ├── mc_dropout.py             # Uncertainty estimation
│   │   └── ucb_exploration.py        # RL strategy
│   ├── data/
│   │   └── dataset.py                # PyTorch DataLoader
│   └── training/
│       └── train_with_mlflow.py      # Training + MLflow
│
├── notebooks/            # Demonstrations
│   └── rl_demonstration.ipynb        # RL showcase
│
├── .github/              # CI/CD
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
│
├── docker-compose.override.yml       # MLflow service
├── Dockerfile                        # Astro Runtime
└── docs/                             # Documentation

📚 Documentation

Quick Start Guide - Get started in 15 minutes
Architecture Overview - System design and components
Project Summary - Executive summary and features
Next Steps - Implementation checklist
Pre-commit Guide - Code quality setup

🔧 Installation

Prerequisites

Python 3.10+
Docker Desktop (running)
8GB+ RAM
10GB+ free disk space

Setup

# Install Astro CLI
brew install astro  # macOS
# OR
pip install astro-cli

# Verify installation
astro version

# Start services
cd /Users/coltallen/Projects/vacation_rentals
astro dev start

This will start:

Apache Airflow (webserver, scheduler, database)
MLflow tracking server
All necessary dependencies

🎓 Technical Details

Model Architecture

LocationRetrievalModel(LightningModule)
├── Embeddings: Categorical features (dim=64)
├── FC1: Linear(input_dim, 256) + ReLU + Dropout(0.5)
├── FC2: Linear(256, 256) + ReLU + Dropout(0.5)
└── Output: Linear(256, 4)  # [lat_min, lng_min, lat_max, lng_max]

Input Features

Continuous (11):

search_latitude, search_longitude
search_year, search_month, search_dayofweek
is_weekend, is_summer
price_per_night
neighborhood_listing_count, neighborhood_avg_price, neighborhood_median_price

Categorical (3):

preferred_room_type
preferred_property_type
booked_neighborhood

Loss Function

Total Loss = λ₁·Containment + λ₂·Area + λ₃·Validity

Containment: Ensures target within predicted box
Area: Penalizes overly large boxes
Validity: Ensures min < max coordinates

Weights: λ₁=10.0, λ₂=1.0, λ₃=5.0

Reinforcement Learning Strategy

Monte Carlo Dropout:

predictor = MCDropoutPredictor(model, n_samples=32)
mean, std = predictor.predict(features)
# std represents epistemic uncertainty

UCB Exploration:

ucb = UCBExploration(exploration_factor=2.0)
explored_pred = ucb.compute_ucb(mean, std)
# Higher uncertainty → larger areas → more exploration

📊 Data Pipeline

dbt Models

Staging: Clean raw data
- stg_listings.sql - Validate and clean listings
- stg_reviews.sql - Parse review dates
Features: Feature engineering
- fct_bookings.sql - Generate synthetic bookings from reviews
- fct_searches.sql - Create search queries with spatial variation
- fct_features.sql - Compute all ML features
Exports: Parquet files for PyTorch
- exp_train.sql - Training set (70%)
- exp_val.sql - Validation set (15%)
- exp_test.sql - Test set (15%)

Airflow DAGs

data_pipeline: Daily at 2 AM UTC
- Load raw CSV.gz → DuckDB
- Run dbt staging models
- Run dbt feature models
- Export to Parquet
- Validate data quality
ml_training_pipeline: Daily at 4 AM UTC (after data pipeline)
- Load Parquet data
- Train PyTorch Lightning model
- Track experiments in MLflow
- Register best model
- Evaluate on test set

🧪 Testing

# Test Airflow DAGs
astro dev pytest airflow/dags/

# Test dbt models
astro dev bash
cd /usr/local/airflow/dbt
dbt test
exit

# Test PyTorch models
cd ml && python -m pytest

# Run all tests
astro dev pytest

🎨 RL Demonstration

Interactive Jupyter notebook showcasing:

Load trained model from MLflow
MC Dropout predictions with uncertainty
UCB exploration strategy comparison
Visualization of exploration vs exploitation
Simulated retraining cycle

cd notebooks
jupyter notebook rl_demonstration.ipynb

📈 Results

Expected Performance (Munich Dataset)

Metric	Target	Notes
Booking Recall	>70%	Predicted areas contain booked listings
Mean Area	<50 km²	Compact, efficient retrieval
Calibration (1σ)	~68%	Well-calibrated uncertainty
Exploration Ratio	1.5-3x	Adaptive based on uncertainty

🔄 Development Workflow

# Make changes to code
vim ml/models/location_model.py

# Restart if needed
astro dev restart

# View logs
astro dev logs

# Run specific DAG
astro dev run dags trigger data_pipeline

# Access container
astro dev bash

🚀 Deployment

Local Development

astro dev start  # Already set up!

Production (Astronomer Cloud)

# Deploy to Astronomer
astro deploy <deployment-id>

# Or use GitHub Actions
git push origin main  # Triggers CI/CD

📦 Data Source

Inside Airbnb - Munich, Germany

Listings: ~20,000 properties
Reviews: ~200,000 (used as booking proxy)
License: CC BY 4.0

🎓 Learning Resources

🔮 Future Enhancements

Short Term

Multi-city support (Paris, Barcelona, Mallorca)
A/B testing framework
Real-time inference API
Advanced monitoring dashboards

Medium Term

Automated retraining on drift
Feature store integration
Model ensembles
Thompson sampling

Long Term

Multi-modal features (images, text)
Personalization layers
Real-time streaming
Global deployment

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Inspired by Airbnb's excellent work on location retrieval optimization. This is an educational project demonstrating production ML engineering practices with modern tools (Airflow, dbt, PyTorch, MLflow).

💼 Skills Demonstrated

Data Engineering: dbt + DuckDB + SQL
ML Engineering: PyTorch Lightning + custom loss functions
MLOps: Airflow + MLflow + Docker
RL Algorithms: Contextual bandits + MC Dropout + UCB
DevOps: CI/CD + GitHub Actions + testing
Software Engineering: Code quality + documentation + best practices

📞 Contact

For questions or suggestions, please open an issue on GitHub.

Version: 2.0.0 (Production Architecture) Status: ✅ Complete and ready for demo Built with: Airflow + dbt + PyTorch + MLflow + Docker

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
airflow/dags		airflow/dags
data		data
dbt		dbt
docs		docs
ml		ml
models		models
notebooks/exploratory		notebooks/exploratory
reports		reports
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
airflow_settings.yaml		airflow_settings.yaml
docker-compose.override.yml		docker-compose.override.yml
packages.txt		packages.txt
pyproject.toml		pyproject.toml
requirements-airflow.txt		requirements-airflow.txt
requirements.txt		requirements.txt
setup.py		setup.py

ColtAllen/vacation-rentals

Folders and files

Latest commit

History

Repository files navigation