This repository contains the work of a group project undertaken as part of the Master's program in Data Science and Advanced Analytics at Nova Information Management School (Nova IMS). The project aims to develop a machine learning model to assist the New York Workers' Compensation Board (WCB) in automating decision-making processes for workers’ compensation claims. With over 5 million claims under WCB's jurisdiction, our goal was to predict the "Claim Injury Type" using supervised learning techniques, ultimately achieving optimized model performance.
-
Data Understanding and Preparation:
- Perform exploratory data analysis to uncover key dataset characteristics.
- Handle missing values, outliers, and imbalances in the data.
- Engineer and select features for robust predictive modeling.
-
Model Development and Comparison:
- Train and evaluate multiple classification models, including KNN, Support Vector Classifier, Logistic Regression, Decision Trees, Random Forest, Neural Networks, and XGBoost.
- Identify the best-performing model and optimize its parameters for improved accuracy and reliability.
-
Open-Ended Exploration:
- Analyze and predict the variable ‘Agreement Reached’ and assess its potential as an additional feature for enhancing primary model performance.
- Best Model: XGBoost achieved the highest Kaggle score of 0.31619 for the prediction of "Claim Injury Type."
- Data Preprocessing: Robust handling of skewed numerical variables, missing data, and encoding of high-cardinality categorical variables improved the model training process.
- Feature Selection: Techniques like Lasso and Recursive Feature Selection were essential in mitigating dimensionality issues and computational complexity.
- Open-Ended Analysis: Predicting ‘Agreement Reached’ with an AUC of 0.8732 using Support Vector Classifier (SVC) demonstrated the potential to enhance the primary prediction model, though improvements depended on the correlation of added features.
- Notebooks:
- Exploratory Data Analysis
- Data Preprocessing
- Feature Engineering and Selection
- Model Training and Evaluation
- Open-Ended Analysis for ‘Agreement Reached’
- Reports: Comprehensive documentation of the methodology, findings, and conclusions.
- High dimensionality and imbalanced target classes increased computational demands and risks of overfitting.
- Dependence on dataset quality limited potential for external validations.
- Address target class imbalances using techniques like SMOTE (Synthetic Minority Oversampling Technique).
- Explore ensemble methods (e.g., stacking or blending) to combine model strengths.
- Incorporate external data sources such as economic and geographic indicators to enrich feature space.
This project demonstrates the effectiveness of machine learning in automating complex decision-making processes in the insurance sector. While the findings align with our expectations, identified limitations and opportunities for improvement point toward the potential for more advanced analyses in the future.
