Credit risk scoring model using XGBoost on the German Credit Dataset, with SHAP explanations and fairness checks.
Predicts loan default probability and converts it to a traditional credit score (300-850 range). The model also generates adverse action reasons — basically telling the applicant why they were rejected, which banks are legally required to do (ECOA, GDPR Article 22).
German Credit Dataset from UCI — 1000 loan applications, 20 features, binary outcome. Not huge, but it's a standard benchmark and the categorical encoding is a pain to deal with (everything is coded as A11, A12, etc).
| Metric | Score |
|---|---|
| AUC-ROC | 0.78 |
| Gini | 0.56 |
| KS | 0.46 |
Not amazing, but reasonable for 1000 samples with no heavy tuning.
pip install -r requirements.txt
python main.pyOr run the dashboard:
streamlit run streamlit_app/app.pysrc/data/— loading + preprocessing the German Credit datasrc/features/— feature engineering (debt ratios, stability scores, etc)src/models/— XGBoost trainer + scorecard conversionsrc/explainability/— SHAP explanations + Fairlearn fairness auditstreamlit_app/— interactive dashboardtests/— unit tests
Each prediction comes with SHAP values showing which features pushed the score up or down. There's also an adverse action module that maps SHAP contributions to human-readable denial reasons (e.g. "Insufficient checking account history").
The model gets audited across age groups and gender using Fairlearn. Checks the four-fifths rule (80% rule) and demographic parity. On this dataset the model passes, but barely — the age group disparity is close to the threshold.
- The German Credit dataset is from 1994 and uses Deutsche Marks. Would be better with more recent data
- I should have tried WoE (Weight of Evidence) binning — that's what actual banks use for scorecards
- The feature engineering is manual. Could try automated feature selection with Boruta or similar
- Hyperparameter tuning is basically default XGBoost params with minor tweaks
MIT