The objective of this project is to build a machine learning model that predicts house prices based on various features such as lot size, building type, year built, and more. This project aims to assist potential buyers and sellers in making informed decisions by providing accurate price predictions.
The dataset used for this project contains 13 features and 2919 records. It includes the following key attributes:
- Id: Record identifier.
- MSSubClass: Type of dwelling involved in the sale.
- MSZoning: General zoning classification of the sale.
- LotArea: Lot size in square feet.
- LotConfig: Configuration of the lot.
- BldgType: Type of dwelling.
- OverallCond: Overall condition of the house.
- YearBuilt: Original construction year.
- YearRemodAdd: Remodel date.
- Exterior1st: Exterior covering on the house.
- BsmtFinSF2: Type 2 finished square feet.
- TotalBsmtSF: Total square feet of basement area.
- SalePrice: Target variable to be predicted.
- Categorization of Features: Features were categorized based on their data types (categorical, integer, float).
- Categorical variables: 4
- Integer variables: 6
- Float variables: 3
- Handling Missing Values:
- Columns irrelevant to prediction (e.g.,
Id) were dropped. - Missing values in
SalePricewere replaced with the mean value to ensure a symmetric data distribution. - Records with null values in other features were dropped if their count was minimal.
- Columns irrelevant to prediction (e.g.,
- OneHotEncoding: Categorical features were converted into binary vectors using OneHotEncoder to make them suitable for machine learning models.
EDA was conducted to uncover patterns and relationships in the data:
- Heatmap: A correlation heatmap was created using the Seaborn library to identify relationships between features and the target variable (
SalePrice). - Barplots: Barplots were used to analyze the distribution of categorical features like
Exterior1st, which has 16 unique categories. This helped in understanding the frequency of each category.
The following regression models were used to predict house prices:
- Support Vector Machine (SVM):
- SVM was used for regression by finding the optimal hyperplane in an n-dimensional space.
- Mean Absolute Percentage Error (MAPE): 0.18705129
- Random Forest Regressor:
- An ensemble technique that uses multiple decision trees for regression.
- MAPE: 0.1929469
- Linear Regression:
- A simple regression model that predicts the dependent variable (
SalePrice) based on independent features. - MAPE: 0.187416838
- A simple regression model that predicts the dependent variable (
- CatBoost Regressor:
- A gradient boosting algorithm optimized for categorical data.
- MAPE: 0.383511698
The Support Vector Machine (SVM) model achieved the best performance with the lowest Mean Absolute Percentage Error (MAPE) of 0.187. This indicates that the SVM model is the most accurate among the models tested for predicting house prices. However, further improvements can be made using ensemble techniques like Bagging and Boosting.
For a detailed view of the analysis and visualizations, you can access the Jupyter Notebook here.