Skip to content

Adinath-Jagtap/House-Prices-Advanced-Regression-Techniques

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🏡 House Prices - Advanced Regression Techniques

Python scikit-learn XGBoost Kaggle License

Predicting residential home prices in Ames, Iowa using advanced regression techniques

Kaggle CompetitionKaggle NotebookDocumentation


📊 Performance

Metric Score
Public Leaderboard 0.12029
Evaluation Metric RMSE (Log Scale)
Model Architecture Stacking Ensemble

🎯 Project Overview

This repository contains a comprehensive solution for the House Prices: Advanced Regression Techniques competition on Kaggle. The project implements sophisticated feature engineering techniques and ensemble modeling to predict residential home prices with high accuracy.

Competition Context

With 79 explanatory variables describing various aspects of residential homes in Ames, Iowa, this competition challenges participants to predict the final sale price of each property. The dataset offers a modernized alternative to the classic Boston Housing dataset, providing rich opportunities for creative feature engineering and advanced modeling techniques.


🚀 Key Features

Feature Engineering

  • YrBltAndRemod: Combined year built and remodel information
  • TotalSF: Total square footage across all floors
  • Total_sqr_footage: Comprehensive basement and floor area calculation
  • Total_Bathrooms: Weighted bathroom count (full + 0.5 × half baths)
  • Total_porch_sf: Aggregate porch and deck square footage

Data Processing

  • Intelligent Missing Value Handling: Neighborhood-based imputation for LotFrontage
  • Categorical Encoding: OneHotEncoder with drop='first' to avoid multicollinearity
  • Comprehensive Preprocessing: Separate handling for numeric and categorical features

Model Architecture

┌─────────────────────────────────────┐
│       Stacking Regressor            │
├─────────────────────────────────────┤
│  Base Models:                       │
│  • Ridge Regression (α=15)          │
│  • XGBoost Regressor (tuned)        │
├─────────────────────────────────────┤
│  Meta-Model:                        │
│  • Linear Regression                │
└─────────────────────────────────────┘

XGBoost Hyperparameters

{
    'max_depth': 4,
    'learning_rate': 0.00875,
    'n_estimators': 3515,
    'min_child_weight': 2,
    'colsample_bytree': 0.205,
    'subsample': 0.404,
    'reg_alpha': 0.330,
    'reg_lambda': 0.046
}

📁 Repository Structure

House-Prices-Advanced-Regression-Techniques/
│
├── house-prices-advanced-regression-techniques.ipynb
│   └── Complete analysis and model training notebook
│
├── submission.csv
│   └── Best scoring predictions (Public Score: 0.12029)
│
├── README.md
    └── Project documentation

🛠️ Technologies Used

Category Tools
Language Python 3.8+
Data Processing pandas, NumPy
Machine Learning scikit-learn, XGBoost
Feature Engineering OneHotEncoder, Custom transformations
Model Ensemble StackingRegressor

📈 Methodology

1. Data Preprocessing

  • Load training and test datasets
  • Combine datasets for consistent preprocessing
  • Handle missing values with domain-specific strategies
  • Create engineered features

2. Feature Engineering

# Example: Total square footage
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

# Example: Total bathrooms (weighted)
df['Total_Bathrooms'] = (df['FullBath'] + 0.5 * df['HalfBath'] + 
                          df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath'])

3. Encoding

  • OneHotEncoder for categorical variables
  • Drop first category to prevent multicollinearity
  • Preserve numeric features as-is

4. Model Training

  • Ridge Regression with L2 regularization (α=15)
  • XGBoost with optimized hyperparameters
  • Stacking ensemble with Linear Regression as meta-model
  • Target transformation using log1p for better prediction

5. Prediction

  • Generate predictions on test set
  • Apply exponential transformation to reverse log scale
  • Create submission file in required format

🔧 Installation & Usage

Prerequisites

pip install pandas numpy scikit-learn xgboost

Quick Start

  1. Clone the repository
git clone https://github.com/Adinath-Jagtap/House-Prices-Advanced-Regression-Techniques.git
cd House-Prices-Advanced-Regression-Techniques
  1. Download the dataset
  • Visit the Kaggle competition page
  • Download train.csv, test.csv, sample_submission.csv, and data_description.txt
  • Place files in the data/ directory
  1. Run the notebook
jupyter notebook house-prices-advanced-regression-techniques.ipynb
  1. Generate predictions The notebook will automatically create submission.csv with predictions

📊 Model Performance Analysis

Evaluation Metric

Root Mean Squared Error (RMSE) on logarithmic scale:

RMSE = sqrt(mean((log(predicted) - log(actual))²))

This metric ensures errors in predicting expensive and inexpensive houses are weighted equally.

Performance Breakdown

  • Ridge Regression: Provides stable baseline with regularization
  • XGBoost: Captures non-linear relationships and interactions
  • Stacking: Combines strengths of both models for optimal performance

🎓 Key Learnings

Feature Engineering

  • Domain knowledge significantly improves prediction accuracy
  • Combining related features often creates more predictive variables
  • Proper handling of missing values is crucial for model performance

Model Selection

  • Ensemble methods consistently outperform individual models
  • Regularization prevents overfitting on high-dimensional data
  • Hyperparameter tuning is essential for XGBoost performance

Competition Strategy

  • Log transformation of target variable improves RMSE
  • Stacking different model types captures diverse patterns
  • Careful preprocessing ensures consistent train/test predictions

🔍 Future Improvements

  • Implement cross-validation for robust performance estimation
  • Explore additional feature interactions
  • Test alternative ensemble methods (e.g., Voting, Blending)
  • Incorporate neural network models
  • Add feature selection techniques
  • Implement automated hyperparameter tuning (Optuna, GridSearchCV)

📚 References

Competition Resources

Documentation

Dataset

The Ames Housing dataset was compiled by Dean De Cock for data science education as a modernized alternative to the Boston Housing dataset.


👤 Author

Adinath Jagtap


🙏 Acknowledgments

  • Kaggle for hosting the competition and providing the platform
  • Dean De Cock for compiling the Ames Housing dataset
  • DataCanary and Anna Montoya for competition organization
  • The Kaggle community for valuable discussions and insights

📞 Contact

For questions, suggestions, or collaboration opportunities:


If you find this project helpful, please consider giving it a ⭐!

Made with ❤️ for the Data Science Community

About

Advanced regression techniques for predicting house prices using ensemble methods (Ridge, XGBoost, Stacking). Achieved Public Score: 0.12029 on Kaggle's House Prices competition through feature engineering and model stacking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors