Loan Default Prediction Model

Predicting Borrower Risk to Reduce Financial Losses

Executive Summary

We built and evaluated three classification models: Logistic Regression, Random Forest, and XGBoost; to predict loan defaults on a retail lending portfolio. Our goal was to identify high-risk borrowers early, enabling targeted interventions and minimizing losses.

Key Findings:

Dataset size: 2M+ loan records
Default rate in data: 12.4%
Best recall (catching most defaults): Random Forest (0.86)
Best ROC-AUC (overall discrimination): Random Forest (0.65)
Highest F1-score (balance between precision and recall): Logistic Regression & Random Forest (0.26)

Critical Insights:

High Recall Priority: In credit risk modeling, recall is more critical than precision. Catching more defaulters, even with some false alarms—prevents financial losses. A missed defaulter (false negative) can result in unrecoverable capital, whereas a false positive can be handled via manual review or adjusted loan terms.
Model Comparison:
- Random Forest achieves the highest recall (0.86) and ROC-AUC (0.65), making it ideal for minimizing undetected defaults.
- Logistic Regression provides better interpretability and similar F1-score (0.26) with fewer false positives (precision 0.17).
- XGBoost underperforms on all metrics relative to others.
Model Selection Rationale: We recommend Random Forest as the primary model due to its superior recall and discrimination power. Logistic Regression can still be useful for interpretability and quick deployment.

The Challenge

Loan default prediction is critical for retail lenders to manage credit risk and maintain profitability. Our data contained demographic, credit bureau, and historical repayment features. The low base rate of defaults (12.4%) and noisy real-world signals make high-precision, high-recall models difficult to achieve simultaneously.

Model Performance Summary

Model	Precision	Recall	F1-score	ROC-AUC
Logistic Regression	0.17	0.52	0.26	0.60
Random Forest	0.15	0.86	0.26	0.65
XGBoost	0.15	0.41	0.22	0.57

Key Discovery

The ROC-AUC curve shows Random Forest delivers the best trade-off between sensitivity and specificity. Its recall of 0.86 means it successfully flags 86% of all actual defaulters. While its precision (0.15) is low, this is an acceptable compromise for use cases focused on minimizing undetected risk.

Why High Recall Matters

Financial Risk Reduction: Catching more defaulters reduces charge-offs and improves portfolio health.
Business Justification: False positives (non-defaulters flagged as defaulters) can be managed operationally with manual checks or conservative credit terms.
Compliance: Models must show robustness in identifying risk accurately, even at the cost of precision.

Best Performing Model

Random Forest is the most effective model, delivering:
- Highest recall (0.86)
- Best ROC-AUC (0.65)
- Competitive F1-score (0.26)

Given its predictive strength and robustness, Random Forest is recommended as the production candidate, especially in high-stakes lending environments.

Error Analysis & Next Steps for Low Precision

Issue: The low precision across all models (0.15–0.17) indicates a high false positive rate. Many non-defaulters are incorrectly flagged.
Next Steps:
1. Threshold Tuning: Adjust probability thresholds to optimize for a better precision-recall balance suited to business goals.
2. Segment Analysis: Identify borrower subgroups where the model struggles, such as thin-file or new-to-credit applicants.
3. Cost-Sensitive Training: Penalize false positives more explicitly to force the model to improve its precision.

Call to Action

Deploy Random Forest
- Launch in a pilot phase to monitor recall and operational false positives.
- Create a fallback review process for flagged borrowers.
Enhance Dataset
- Add alternative data (social data, utility payments).
- Enrich features to improve signal-to-noise ratio.
Optimize Decision Thresholds
- Calibrate probability cutoffs using business impact simulations.

Looking Forward

Target Metrics:

ROC-AUC: ≥ 0.75
Recall: ≥ 0.70 (even if precision is ~0.20)
Precision: ≥ 0.25 (at recall ≥ 0.50)

Implementation Timeline:

Phase 1 (30 days): Deploy Random Forest and monitor KPI drift.
Phase 2 (60 days): Conduct more robust feature engineering and threshold tuning.
Phase 3 (90 days): Retrain with improved data and refine business integration.

All metrics derived from cross-validation on hold-out datasets. Model selection and thresholds should be reviewed jointly by data science and credit policy teams.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
notebooks		notebooks
reports		reports
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loan Default Prediction Model

Executive Summary

The Challenge

Model Performance Summary

Key Discovery

Why High Recall Matters

Best Performing Model

Error Analysis & Next Steps for Low Precision

Call to Action

Looking Forward

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

kausikds/loan-default-prediction-model

Folders and files

Latest commit

History

Repository files navigation

Loan Default Prediction Model

Executive Summary

The Challenge

Model Performance Summary

Key Discovery

Why High Recall Matters

Best Performing Model

Error Analysis & Next Steps for Low Precision

Call to Action

Looking Forward

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages