🏡 Predicting Airbnb Listing Success

This project builds a full end-to-end machine learning pipeline to predict the success tier of an Airbnb listing — categorized as Excellent, Good, or Average — using only publicly available metadata.

🔎 Goal: Help real estate agents, analysts, and hosts evaluate listing potential before deployment, based on observable features like price, reviews, availability, and host trust signals.

🎯 Project Summary

Type: Multi-class classification
Target: success_label (Excellent / Good / Average)
Model: Tuned Random Forest
Evaluation: Accuracy, F1 Score, Confusion Matrix, Feature Importance

🧠 What Defines Success?

Since Airbnb doesn’t provide a built-in success score, we engineered a label using:

✅ adj_rating: Adjusted guest rating (normalized scale)
✅ log_reviews_ltm: Log-transformed review count (last 12 months)
✅ has_rating: Has the listing ever been rated?
✅ instant_bookable: Can it be booked without host approval?
✅ is_superhost: Airbnb’s trust-based badge

These features were combined into a composite success score, and quantile binned into 3 classes to ensure balance.

🧩 Features Used

Cleaned and engineered features from 5 thematic blocks:

Block	Examples
📍 Location	`latitude`, `longitude`, `neighbourhood_encoded`
💵 Price	`log_price`, `availability_365`, `instant_book_int`
🧽 Reviews	`has_rating`, `log_reviews_ltm`, `adj_rating`
🏘️ Property	`property_type_encoded`, `beds_per_guest`, `baths_per_guest`
🧑 Host Trust	`hrt_ord` (ordinal host response time), `is_superhost`

Also:

Target leakage was avoided through explicit column dropping
Safe encoding strategies like train-only target encoding were applied

📊 Modeling Approach

Baseline: Random Forest (n=100, default)
Tuned Model: GridSearchCV over max_depth, min_samples_split, n_estimators
Scoring: f1_weighted (to account for class imbalance)

Key Metrics

Metric	Tuned Model
Accuracy	~0.71
Weighted F1 Score	~0.71
Macro F1 Score	~0.68
Avg Tree Depth	~30–40

📈 Feature Importance (Top 7)

log_reviews_ltm
has_reviews
latitude, longitude
log_price
availability_365
hrt_ord
neighbourhood_encoded

🔒 Clean Modeling Practices

🚫 No leakage: all features used in target creation were excluded from training
🧪 Stratified train-test split
🧼 Final datasets (train_df_ready.csv, test_df_ready.csv) exported cleanly

📁 Files in this Repository

File	Description
`airbnb_success_pipeline.ipynb`	Full project notebook
`train_df_ready.csv`	Cleaned training data
`test_df_ready.csv`	Cleaned test data
`README.md`	This file

📂 Download Processed Datasets

Due to GitHub's file size limitations, the full processed CSVs are hosted externally:

🔗 Download csv files here

To use them:

Download all files
Place them in the root of this project directory
Open airbnb_success_prediction.ipynb and run the notebook

👨‍💻 Author

Shawn Waringu
Data Scientist & Analyst

LinkedIn GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
airbnb_success_prediction.ipynb		airbnb_success_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏡 Predicting Airbnb Listing Success

🎯 Project Summary

🧠 What Defines Success?

🧩 Features Used

📊 Modeling Approach

Key Metrics

📈 Feature Importance (Top 7)

🔒 Clean Modeling Practices

📁 Files in this Repository

📂 Download Processed Datasets

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏡 Predicting Airbnb Listing Success

🎯 Project Summary

🧠 What Defines Success?

🧩 Features Used

📊 Modeling Approach

Key Metrics

📈 Feature Importance (Top 7)

🔒 Clean Modeling Practices

📁 Files in this Repository

📂 Download Processed Datasets

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages