A complete machine learning solution for predicting house prices using the Ames Housing dataset. This project implements an end-to-end pipeline from data preprocessing to web deployment, achieving competitive performance on the Kaggle leaderboard.
- Data Processing Pipeline: Automated cleaning and feature engineering for optimal model performance
- Multiple ML Models: Linear Regression and XGBoost implementations with performance comparison
- Interactive Web Application: Streamlit-based interface for real-time predictions and custom model training
- Production Ready: Docker containerization and comprehensive evaluation metrics
- Extensible Architecture: Modular design supporting easy integration of new models and features
├── 📂 data/ # Datasets and description
│ ├── 📄 train.csv # Training data
│ ├── 📄 test.csv # Test data
│ ├── 📄 new_train.csv # Processed training data
│ ├── 📄 new_test.csv # Processed test data
│ └── 📄 data_description.txt
│
├── 📂 models/ # Trained models and schemas
│ ├── 📦 linear_regression_model.pkl
│ └── 📄 schemas.py # Pydantic data validation
│
├── 📂 configs/ # Application configuration
│ └── ⚙️ config.py
│
├── 📂 views/ # Streamlit application components
│ ├── 🏡 house_price.py # Main prediction interface
│ ├── 🛠️ custom_linear_app.py # Custom model training
│ └── 🚀 custom_xgboost.py # XGBoost implementation
│
├── 📓 data_cleaning_and_feature_engineering.ipynb
├── 🚀 main.py # Application entry point
├── 📦 requirements.txt # Dependencies
├── 🐳 Dockerfile # Container configuration
└── 🖥️ start.sh # Launch script
- Python 3.8+
- pip or conda
- Kaggle account (for dataset access)
-
Clone the repository
git clone <repository-url> cd Kaggle_Advanced_House_Prices
-
Set up environment
| Model | MAE | MSE | RMSE |
|---|---|---|---|
| Linear Regression | 19,452.09 | 711,102,117.35 | 26,666.50 |
| XGBoost | 16,483.55 | 578,007,680.00 | 24,041.79 |
XGBoost demonstrates superior performance across all metrics, providing more accurate and robust predictions for house price estimation.
python3.11 -m venv .venv
source .venv/bin/activateOr alternatively:
conda env create -f environment.yml
conda activate kaggle-house-pricesconda install ipykernel
python -m ipykernel install --user --name loan-approval --display-name "Loan Approval"- In VSCode Press
Ctrl+Shift+Pand select "Python: Select Interpreter", then choose the "Loan Approval" interpreter. - Once you open the Jupyter Notebook, it should automatically use the "Loan Approval" kernel. If not, please restart VSCode. And if not successful, then on the top right corner of the notebook, you can manually select the kernel by clicking on it and choosing "Loan Approval". You most likely will find it in the Jupyter kernel list.
python -m venv .venv
.\.venv\Scripts\activatepip install -r requirements.txtBuild and start the app using Docker:
docker build -t streamlit_app .
docker run --rm -p 8501:8501 streamlit_appOr use the provided shell script:
bash start.shTo use Kaggle datasets or APIs, you need to set up your Kaggle credentials:
-
🔗 Go to your Kaggle account settings: https://www.kaggle.com/account.
-
🛡️ Scroll down to the API section and click Create New API Token.
-
📥 This will download a file named
kaggle.json. -
📂 Place
kaggle.jsonin the folder:- 🐧 Linux/macOS:
~/.kaggle/or sometimes in~/.config/kaggle/kaggle.json - 🪟 Windows:
C:\Users\<YourUsername>\.kaggle\
- 🐧 Linux/macOS:
-
🔒 Make sure the file permissions are secure (Linux/macOS):
chmod 600 ~/.kaggle/kaggle.json -
In order to be able to download the dataset you must join the competition and accept the rules.
-
📦 You can now use the Kaggle CLI to download datasets:
kaggle competitions download -c house-prices-advanced-regression-techniques -p data/
- Open and run
data_cleaning_and_feature_engineering.ipynbto preprocess and engineer features from the raw data. - Follow up to train a Linear Regression Model on this difficult case.
- Use scripts or notebook to train and save models in
models/(default model provided). Note: By running thedata_cleaning_and_feature_engineering.ipynbnotebook, it will train the model and save it automatically.
-
Run the following command or using docker:
streamlit run main.py
-
Use the sidebar to select between house price prediction and custom regression modules.
-
Train your own model on your own data using the custom regression module.
Virtual Environment:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt
Conda Environment:
conda env create -f environment.yml conda activate kaggle-house-prices
-
Download dataset (requires Kaggle account)
kaggle competitions download -c house-prices-advanced-regression-techniques -p data/
-
Run the application
streamlit run main.py
# Build and run with Docker
docker build -t house-prices-app .
docker run --rm -p 8501:8501 house-prices-appScored using the competition metric: RMSE on log(SalePrice).
- submission_blended.csv: 0.12234 (best) ✨
- submission_outlier.csv: 0.12442
- submission.csv (base): 0.14783
- House Price Prediction: Input property features to get price predictions
- Custom Model Training: Upload your own dataset and train models interactively
- Model Comparison: Compare performance between different algorithms
Run data_cleaning_and_feature_engineering.ipynb to:
- Explore the dataset through comprehensive EDA
- Apply feature engineering techniques
- Train and evaluate machine learning models
- Generate submission files for Kaggle
Key configuration options in configs/config.py:
- Model file paths
- Feature definitions and default values
- Application settings and parameters
The application uses Pydantic schemas for data validation:
# Input features validation
class HouseFeatures(BaseModel):
overall_qual: int
gr_liv_area: float
garage_cars: float
# ... additional features
# Prediction output
class PricePrediction(BaseModel):
predicted_price: float
confidence_interval: Optional[Tuple[float, float]]- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Kaggle for providing the dataset and competition platform
- The Ames Housing dataset compiled by Dean De Cock
- Streamlit for the excellent web framework
- The open-source community for the amazing machine learning libraries
For detailed setup and learning instructions, see:
- Student Instructions - Complete learning guide
- Environment Setup - Development environment
- Kaggle Setup - API and data access
- GitHub Setup - Version control setup
new