Skip to content

An end-to-end data science workflow leveraging reinforcement learning for vacation rental search queries.

Notifications You must be signed in to change notification settings

ColtAllen/vacation-rentals

Repository files navigation

Vacation Rentals ML Pipeline

Production-Grade ML System for Location Retrieval

Python 3.10+ Apache Airflow dbt PyTorch MLflow pre-commit

End-to-end ML pipeline implementing contextual multi-armed bandits for vacation rental location retrieval, inspired by Airbnb's RL approach.

🎯 Overview

This project demonstrates how to build a production-ready machine learning system that dynamically determines relevant search areas for vacation rental queries using:

  • Deep Learning: PyTorch Lightning neural network with custom loss function
  • Data Engineering: dbt transformations with DuckDB
  • Orchestration: Apache Airflow (via Astro CLI)
  • Experiment Tracking: MLflow for model lifecycle management
  • Reinforcement Learning: Monte Carlo Dropout + UCB exploration for contextual bandits
  • CI/CD: GitHub Actions for automated testing

🚀 Quick Start

# 1. Install Astro CLI
brew install astro  # or: pip install astro-cli

# 2. Start services (Airflow + MLflow)
astro dev start

# 3. Access UIs
# Airflow: http://localhost:8080 (admin/admin)
# MLflow:  http://localhost:5000

# 4. Prepare data (download Inside Airbnb data)
cd data/raw
wget http://data.insideairbnb.com/germany/bv/munich/2024-09-24/data/listings.csv.gz
wget http://data.insideairbnb.com/germany/bv/munich/2024-09-24/data/reviews.csv.gz
cd ../..

# 5. Run pipelines
astro dev run dags trigger data_pipeline
astro dev run dags trigger ml_training_pipeline

# 6. View results
open http://localhost:5000  # MLflow experiments

See QUICKSTART.md for detailed setup instructions.

📐 Architecture

┌─────────────────────────────────────────┐
│  Apache Airflow 2.8 (Orchestration)    │
│         via Astro CLI                   │
└─────────────────────────────────────────┘
         │           │            │
         ▼           ▼            ▼
    ┌───────┐  ┌──────────┐  ┌────────┐
    │  dbt  │  │ PyTorch  │  │ MLflow │
    │ 1.7   │  │ Light.   │  │  2.10  │
    └───────┘  └──────────┘  └────────┘
         │           │            │
         └───────────┴────────────┘
                     │
         ┌───────────────────────┐
         │ DuckDB 1.0 + Parquet  │
         └───────────────────────┘

Data Flow: Raw CSV → DuckDB → dbt → Parquet → PyTorch → MLflow

🎯 Key Features

Orchestration

  • ✅ Automated pipelines with Apache Airflow
  • ✅ Scheduled execution and monitoring
  • ✅ Error handling and retries
  • ✅ Task dependencies and parallelism

Data Engineering

  • ✅ SQL-based transformations with dbt
  • ✅ Data quality tests
  • ✅ Incremental models
  • ✅ Documentation generation

Machine Learning

  • ✅ PyTorch Lightning for scalable training
  • ✅ Custom loss function (containment + area + validity)
  • ✅ Monte Carlo Dropout for uncertainty estimation
  • ✅ UCB exploration strategy
  • ✅ Early stopping and checkpointing

MLOps

  • ✅ MLflow experiment tracking
  • ✅ Model registry and versioning
  • ✅ Artifact storage
  • ✅ Hyperparameter logging

DevOps

  • ✅ Docker-based infrastructure
  • ✅ GitHub Actions CI/CD
  • ✅ Automated testing (pytest, dbt test)
  • ✅ Code quality checks (black, isort, flake8, mypy)

📁 Project Structure

vacation_rentals/
├── airflow/              # Orchestration
│   └── dags/
│       ├── data_pipeline_dag.py       # dbt transformations
│       └── ml_training_dag.py         # PyTorch training
│
├── dbt/                  # Data transformation
│   ├── models/
│   │   ├── staging/      # Clean raw data
│   │   ├── features/     # Feature engineering
│   │   └── exports/      # Parquet exports
│   └── dbt_project.yml
│
├── ml/                   # Machine learning
│   ├── models/
│   │   ├── location_model.py         # PyTorch Lightning
│   │   ├── mc_dropout.py             # Uncertainty estimation
│   │   └── ucb_exploration.py        # RL strategy
│   ├── data/
│   │   └── dataset.py                # PyTorch DataLoader
│   └── training/
│       └── train_with_mlflow.py      # Training + MLflow
│
├── notebooks/            # Demonstrations
│   └── rl_demonstration.ipynb        # RL showcase
│
├── .github/              # CI/CD
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
│
├── docker-compose.override.yml       # MLflow service
├── Dockerfile                        # Astro Runtime
└── docs/                             # Documentation

📚 Documentation

🔧 Installation

Prerequisites

  • Python 3.10+
  • Docker Desktop (running)
  • 8GB+ RAM
  • 10GB+ free disk space

Setup

# Install Astro CLI
brew install astro  # macOS
# OR
pip install astro-cli

# Verify installation
astro version

# Start services
cd /Users/coltallen/Projects/vacation_rentals
astro dev start

This will start:

  • Apache Airflow (webserver, scheduler, database)
  • MLflow tracking server
  • All necessary dependencies

🎓 Technical Details

Model Architecture

LocationRetrievalModel(LightningModule)
├── Embeddings: Categorical features (dim=64)
├── FC1: Linear(input_dim, 256) + ReLU + Dropout(0.5)
├── FC2: Linear(256, 256) + ReLU + Dropout(0.5)
└── Output: Linear(256, 4)  # [lat_min, lng_min, lat_max, lng_max]

Input Features

Continuous (11):

  • search_latitude, search_longitude
  • search_year, search_month, search_dayofweek
  • is_weekend, is_summer
  • price_per_night
  • neighborhood_listing_count, neighborhood_avg_price, neighborhood_median_price

Categorical (3):

  • preferred_room_type
  • preferred_property_type
  • booked_neighborhood

Loss Function

Total Loss = λ₁·Containment + λ₂·Area + λ₃·Validity

Containment: Ensures target within predicted box
Area: Penalizes overly large boxes
Validity: Ensures min < max coordinates

Weights: λ=10.0, λ=1.0, λ=5.0

Reinforcement Learning Strategy

Monte Carlo Dropout:

predictor = MCDropoutPredictor(model, n_samples=32)
mean, std = predictor.predict(features)
# std represents epistemic uncertainty

UCB Exploration:

ucb = UCBExploration(exploration_factor=2.0)
explored_pred = ucb.compute_ucb(mean, std)
# Higher uncertainty → larger areas → more exploration

📊 Data Pipeline

dbt Models

  1. Staging: Clean raw data

    • stg_listings.sql - Validate and clean listings
    • stg_reviews.sql - Parse review dates
  2. Features: Feature engineering

    • fct_bookings.sql - Generate synthetic bookings from reviews
    • fct_searches.sql - Create search queries with spatial variation
    • fct_features.sql - Compute all ML features
  3. Exports: Parquet files for PyTorch

    • exp_train.sql - Training set (70%)
    • exp_val.sql - Validation set (15%)
    • exp_test.sql - Test set (15%)

Airflow DAGs

  1. data_pipeline: Daily at 2 AM UTC

    • Load raw CSV.gz → DuckDB
    • Run dbt staging models
    • Run dbt feature models
    • Export to Parquet
    • Validate data quality
  2. ml_training_pipeline: Daily at 4 AM UTC (after data pipeline)

    • Load Parquet data
    • Train PyTorch Lightning model
    • Track experiments in MLflow
    • Register best model
    • Evaluate on test set

🧪 Testing

# Test Airflow DAGs
astro dev pytest airflow/dags/

# Test dbt models
astro dev bash
cd /usr/local/airflow/dbt
dbt test
exit

# Test PyTorch models
cd ml && python -m pytest

# Run all tests
astro dev pytest

🎨 RL Demonstration

Interactive Jupyter notebook showcasing:

  • Load trained model from MLflow
  • MC Dropout predictions with uncertainty
  • UCB exploration strategy comparison
  • Visualization of exploration vs exploitation
  • Simulated retraining cycle
cd notebooks
jupyter notebook rl_demonstration.ipynb

📈 Results

Expected Performance (Munich Dataset)

Metric Target Notes
Booking Recall >70% Predicted areas contain booked listings
Mean Area <50 km² Compact, efficient retrieval
Calibration (1σ) ~68% Well-calibrated uncertainty
Exploration Ratio 1.5-3x Adaptive based on uncertainty

🔄 Development Workflow

# Make changes to code
vim ml/models/location_model.py

# Restart if needed
astro dev restart

# View logs
astro dev logs

# Run specific DAG
astro dev run dags trigger data_pipeline

# Access container
astro dev bash

🚀 Deployment

Local Development

astro dev start  # Already set up!

Production (Astronomer Cloud)

# Deploy to Astronomer
astro deploy <deployment-id>

# Or use GitHub Actions
git push origin main  # Triggers CI/CD

📦 Data Source

Inside Airbnb - Munich, Germany

  • Listings: ~20,000 properties
  • Reviews: ~200,000 (used as booking proxy)
  • License: CC BY 4.0

🎓 Learning Resources

🔮 Future Enhancements

Short Term

  • Multi-city support (Paris, Barcelona, Mallorca)
  • A/B testing framework
  • Real-time inference API
  • Advanced monitoring dashboards

Medium Term

  • Automated retraining on drift
  • Feature store integration
  • Model ensembles
  • Thompson sampling

Long Term

  • Multi-modal features (images, text)
  • Personalization layers
  • Real-time streaming
  • Global deployment

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Inspired by Airbnb's excellent work on location retrieval optimization. This is an educational project demonstrating production ML engineering practices with modern tools (Airflow, dbt, PyTorch, MLflow).

💼 Skills Demonstrated

  • Data Engineering: dbt + DuckDB + SQL
  • ML Engineering: PyTorch Lightning + custom loss functions
  • MLOps: Airflow + MLflow + Docker
  • RL Algorithms: Contextual bandits + MC Dropout + UCB
  • DevOps: CI/CD + GitHub Actions + testing
  • Software Engineering: Code quality + documentation + best practices

📞 Contact

For questions or suggestions, please open an issue on GitHub.


Version: 2.0.0 (Production Architecture) Status: ✅ Complete and ready for demo Built with: Airflow + dbt + PyTorch + MLflow + Docker

About

An end-to-end data science workflow leveraging reinforcement learning for vacation rental search queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published