#ReDI #26s_dcp_data_circle #data_project
Original idea proposed by Nelson in 2024 as a student project for the Data Circle course at ReDI School of Digital Integration.
This project uses a structured, end-to-end analytical workflow to investigate global developer trends using the Stack Overflow 2025 Developer Survey dataset. It ensures statistical rigour, reproducibility, and alignment with industry benchmarks to provide actionable insights into the evolving professional landscape, identify high-value skill sets, and support data-driven career positioning.
This project uses the 2025 Stack Overflow Developer Survey dataset, a comprehensive primary source representing the global software development ecosystem. Stack Overflow conducted a structured online survey targeting developers of all experience levels and specialisations.
This case study examines the complex factors that influence global developer salaries using advanced regression techniques on the 2025 Stack Overflow dataset. The primary objective is to quantify the predictive power of various indicators — age, formal education level, professional experience, developer roles, employment, work environment, geographic location, etc. — on annual compensation. The study will serve as a data-driven benchmark for students to assess their career prospects and negotiate compensation based on objective global market trends.
- Form teams (2-3 students) and assign roles
- Identify required datasets
- Get raw data and check if it is complete
- Set up project environment, GitHub repository, and documentations
- Choose required features to answer your questions
- Clean and standardise raw fields
- Handle missing values, duplicates, and inconsistent formats
- Normalise numeric fields and parse text‑based features
- Produce a first “clean dataset” for exploration
- Analyse distributions, outliers, and skewness
- Explore relationships between features
- Generate geographic visualisations
- Document insights that inform feature engineering and modelling
- Create new features (programming languages, frameworks, tools)
- Finalise modelling dataset with all engineered features
- Split the final dataset into training (75-80% observations) and test (25-30% observations) datasets
- Train baseline models (e.g., linear regression) for interpretability
- Evaluate performance metrics
- Identify feature gaps or data issues requiring refinement
- Train ensemble models (e.g., Random Forest, Gradient Boosting, CatBoost)
- Tune hyperparameters using k‑fold cross‑validation
- Compare models on cross‑validated performance metrics
- Select top candidates for final evaluation
- Evaluate final models on the test set
- Interpret model results to identify which features have the strongest impact (SHAP or similar techniques)
- Create visualizations showing feature importance
- Document model strengths, limitations, and interpretability findings
- Build interactive dashboard (Streamlit)
- Include maps visualisations, predictions, feature importance
- Compile full project report (methodology, results, recommendations)
- Deliver model, data dictionary, and reproducibility notes
- Give next‑phase recommendations (e.g., deployment, monitoring, further improvements)