Welcome to my implementation of the House Prices - Advanced Regression Techniques. This project leverages data preprocessing, feature engineering, and advanced machine learning techniques to accurately predict housing prices.
This notebook solves a supervised regression problem using the Ames Housing Dataset, which contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.
The goal is to predict the SalePrice of each house in the test set.
train.csv– Training dataset with features andSalePrice(target)test.csv– Test dataset withoutSalePricedata_description.txt– Full metadata for all 79 variablessubmission.csv– Final prediction file
- Loaded
train.csvandtest.csvusing pandas
- Combined train and test for unified cleaning
- Imputed missing values:
- Mode for categorical
- Median for numerical
- Applied
LabelEncoderto all object-type columns
Created new features that improved predictive power:
TotalSF= Total livable square footageTotalBath= Combined bathrooms (full + half)Age= Years since builtRemodAge= Years since last remodelHasGarage,HasPool= Binary flags
- Used XGBoost Regressor (
XGBRegressor) - Applied log-transformation to the target (
SalePrice) for normalization - Evaluated with cross-validation RMSE
- Generated final predictions and saved to
submission.csv
- Model Used: XGBoost
- Cross-Validation RMSE: 0.1270
- Target Transform:
log1p+expm1for SalePrice
| File | Description |
|---|---|
House_Price_Prediction.ipynb |
Full Colab notebook |
train.csv / test.csv |
Dataset files |
submission.csv |
Final prediction file |
- Real-world handling of missing data
- Importance of feature engineering in boosting model performance
- How to combine preprocessing and modeling pipelines effectively
- Power of XGBoost in tabular regression tasks