The project focused on predicting player engagement in online gaming, specifically identifying "high-engagement" players using behavioral, demographic, and game-specific data from a Steam-based dataset.
Problem Definition: Understanding that player retention is key to sustainable business; goal was to predict high engagement and understand purchase behaviors.
- Dataset had 13 features related to demographics, gameplay, and engagement.
- Observed a class imbalance (only ~26% were high-engagement players).
- Variables like SessionsPerWeek and AvgSessionDurationMinutes were found to be strong predictors of engagement.
- One-hot encoding used for categorical features.
- No missing values found.
- Target variable redefined into a binary classification (High vs. Low-Medium engagement).
- Four models were tested: Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP).
- Hyperparameter tuning was performed for each model to improve performance.
- SHAP analysis was used to interpret feature importance.
- Random Forest was the best model with highest accuracy (~95% after tuning).
- Key predictors across models were SessionsPerWeek, AvgSessionDurationMinutes, and AchievementsUnlocked.
- KNN performed the worst compared to others, while MLP was a close second to Random Forest.
- Session frequency and duration are the most important for retention.
- Regional differences matter for marketing strategies.
- Adjustable game difficulty improves engagement.
- Strong recommendation to use ensemble models like Random Forest for retention prediction.
- The study used cross-sectional data (no time-series evolution).
- No causal inferences could be made.
- Future work could involve time-series analysis, social network analysis, and churn prediction models.