A machine learning project to predict annual salaries of professionals based on career, education, and workplace features. Built during the IBM PBEL Internship 2025 using the Stack Overflow survey dataset.
This project explores various factors that influence a software developerβs salary. The goal is to build a robust regression model that can accurately predict salary based on education level, years of experience, job role, remote status, and more.
- Salary prediction using ensemble learning
- Data cleaning and category reduction
- Feature encoding with label encoders
- Model tuning with RandomizedSearchCV
- Feature importance analysis and visualizations
- Streamlit-compatible outputs for deployment
- Languages: Python
- Libraries: Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, Streamlit
- Modeling: RandomForestRegressor (tuned), RandomizedSearchCV
- Serialization: Pickle
- π Source: Stack Overflow Developer Survey (via Kaggle)
- π Years Covered: 2020β2023
- π Records: 10,000+ salary entries
- π‘ Target Variable:
ConvertedCompYearly(renamed toSalary)
- Removed incomplete or irrelevant rows (e.g., non-full-time workers)
- Handled missing values and outliers (salary bounds: $12,000β$250,000)
- Categorical encoding using
LabelEncoderfor:- Country
- Education Level
- Developer Type
- Organization Size
- Remote Work Level
- Applied
log1ptransformation on salary for normality - Saved cleaned data and encoders to
.pklfor reuse
- Model Used: Tuned
RandomForestRegressor - Tuning Method:
RandomizedSearchCVwith 30 iterations - Train-Test Split: 80-20 ratio
- Evaluation Metrics:
- RΒ² Score:
~0.89 - RMSE:
~$15,000
- RΒ² Score:
- Feature Importance: Visualized and exported
| Metric | Value |
|---|---|
| RΒ² Score | ~0.89 |
| RMSE | ~$15,000 |
| Best Features | Country, Dev Type, Org Size, Remote Work |
git clone https://github.com/Riyaa-Bajpai/Salary_Prediction.git
cd Salary_Prediction