Predicting residential home prices in Ames, Iowa using advanced regression techniques
| Metric | Score |
|---|---|
| Public Leaderboard | 0.12029 |
| Evaluation Metric | RMSE (Log Scale) |
| Model Architecture | Stacking Ensemble |
This repository contains a comprehensive solution for the House Prices: Advanced Regression Techniques competition on Kaggle. The project implements sophisticated feature engineering techniques and ensemble modeling to predict residential home prices with high accuracy.
With 79 explanatory variables describing various aspects of residential homes in Ames, Iowa, this competition challenges participants to predict the final sale price of each property. The dataset offers a modernized alternative to the classic Boston Housing dataset, providing rich opportunities for creative feature engineering and advanced modeling techniques.
- YrBltAndRemod: Combined year built and remodel information
- TotalSF: Total square footage across all floors
- Total_sqr_footage: Comprehensive basement and floor area calculation
- Total_Bathrooms: Weighted bathroom count (full + 0.5 × half baths)
- Total_porch_sf: Aggregate porch and deck square footage
- Intelligent Missing Value Handling: Neighborhood-based imputation for LotFrontage
- Categorical Encoding: OneHotEncoder with drop='first' to avoid multicollinearity
- Comprehensive Preprocessing: Separate handling for numeric and categorical features
┌─────────────────────────────────────┐
│ Stacking Regressor │
├─────────────────────────────────────┤
│ Base Models: │
│ • Ridge Regression (α=15) │
│ • XGBoost Regressor (tuned) │
├─────────────────────────────────────┤
│ Meta-Model: │
│ • Linear Regression │
└─────────────────────────────────────┘
{
'max_depth': 4,
'learning_rate': 0.00875,
'n_estimators': 3515,
'min_child_weight': 2,
'colsample_bytree': 0.205,
'subsample': 0.404,
'reg_alpha': 0.330,
'reg_lambda': 0.046
}House-Prices-Advanced-Regression-Techniques/
│
├── house-prices-advanced-regression-techniques.ipynb
│ └── Complete analysis and model training notebook
│
├── submission.csv
│ └── Best scoring predictions (Public Score: 0.12029)
│
├── README.md
└── Project documentation
| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| Data Processing | pandas, NumPy |
| Machine Learning | scikit-learn, XGBoost |
| Feature Engineering | OneHotEncoder, Custom transformations |
| Model Ensemble | StackingRegressor |
- Load training and test datasets
- Combine datasets for consistent preprocessing
- Handle missing values with domain-specific strategies
- Create engineered features
# Example: Total square footage
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
# Example: Total bathrooms (weighted)
df['Total_Bathrooms'] = (df['FullBath'] + 0.5 * df['HalfBath'] +
df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath'])- OneHotEncoder for categorical variables
- Drop first category to prevent multicollinearity
- Preserve numeric features as-is
- Ridge Regression with L2 regularization (α=15)
- XGBoost with optimized hyperparameters
- Stacking ensemble with Linear Regression as meta-model
- Target transformation using log1p for better prediction
- Generate predictions on test set
- Apply exponential transformation to reverse log scale
- Create submission file in required format
pip install pandas numpy scikit-learn xgboost- Clone the repository
git clone https://github.com/Adinath-Jagtap/House-Prices-Advanced-Regression-Techniques.git
cd House-Prices-Advanced-Regression-Techniques- Download the dataset
- Visit the Kaggle competition page
- Download
train.csv,test.csv,sample_submission.csv, anddata_description.txt - Place files in the
data/directory
- Run the notebook
jupyter notebook house-prices-advanced-regression-techniques.ipynb- Generate predictions
The notebook will automatically create
submission.csvwith predictions
Root Mean Squared Error (RMSE) on logarithmic scale:
RMSE = sqrt(mean((log(predicted) - log(actual))²))
This metric ensures errors in predicting expensive and inexpensive houses are weighted equally.
- Ridge Regression: Provides stable baseline with regularization
- XGBoost: Captures non-linear relationships and interactions
- Stacking: Combines strengths of both models for optimal performance
- Domain knowledge significantly improves prediction accuracy
- Combining related features often creates more predictive variables
- Proper handling of missing values is crucial for model performance
- Ensemble methods consistently outperform individual models
- Regularization prevents overfitting on high-dimensional data
- Hyperparameter tuning is essential for XGBoost performance
- Log transformation of target variable improves RMSE
- Stacking different model types captures diverse patterns
- Careful preprocessing ensures consistent train/test predictions
- Implement cross-validation for robust performance estimation
- Explore additional feature interactions
- Test alternative ensemble methods (e.g., Voting, Blending)
- Incorporate neural network models
- Add feature selection techniques
- Implement automated hyperparameter tuning (Optuna, GridSearchCV)
The Ames Housing dataset was compiled by Dean De Cock for data science education as a modernized alternative to the Boston Housing dataset.
Adinath Jagtap
- GitHub: @Adinath-Jagtap
- Kaggle: @adinathjagtap777
- Kaggle for hosting the competition and providing the platform
- Dean De Cock for compiling the Ames Housing dataset
- DataCanary and Anna Montoya for competition organization
- The Kaggle community for valuable discussions and insights
For questions, suggestions, or collaboration opportunities:
If you find this project helpful, please consider giving it a ⭐!
Made with ❤️ for the Data Science Community