| 🎯 CORE | 📊 DATA & METHODS | 💡 INSIGHTS | 🚀 MORE |
|---|---|---|---|
| Overview | Dataset | Key Findings | Quick Start |
| The Problem | Methodology | Visualizations | Team |
| Our Solution | Results | Tech Stack | References |
|
Gradient Capital is a cutting-edge machine learning solution that revolutionizes student loan approval decisions. By leveraging advanced ensemble methods and SMOTENC balancing, we transformed biased, inefficient models into production-ready systems that make fair, accurate, and consistent lending decisions. + 94.71% accuracy with Random Forest
+ Near-perfect discrimination (0.9912 ROC-AUC)
+ 40-point recall improvement through smart balancing
+ Production-ready models for real-world deployment
+ Transparent feature importance for explainable AI |
|
|
Manual loan reviews are:
|
Human decision-making leads to:
|
Traditional systems suffer from:
|
|
Original Dataset Distribution Result: Models heavily biased toward rejection! |
→ |
Impact on Model Performance
|
|
|
|
|
BEFORE SMOTENC ❌ |
AFTER SMOTENC ✅ |
|
|
Figure 2: Class distribution before and after SMOTENC application
|
Origin: Kaggle Loan Approval Dataset
|
💰 Financial Indicators
📋 Applicant Profile
🚨 Risk Factors
|
graph LR
A[45K Loans] --> B[Filter Education<br/>9,153 loans]
B --> C[Remove Outliers<br/>Upper Bound]
C --> D[Equal-Width<br/>Binning]
D --> E[One-Hot<br/>Encoding]
E --> F[Train-Test Split<br/>80-20]
F --> G[Ready for<br/>Training]
style A fill:#e1f5ff
style G fill:#d4edda
|
Naive Bayes Probabilistic
|
Decision Tree Rule-Based
|
K-Nearest Neighbors Distance-Based
|
Random Forest Ensemble
|
🎯 K-Nearest Neighbors Setup
```python # Data Preparation ✓ StandardScaler (critical for distance calculations) ✓ One-hot encoding for categorical features ✓ 80-20 stratified train-test split
├─ Imbalanced Data: K=18 (via 5-fold CV) ├─ Balanced Data: K=5 (re-optimized) ├─ weights='distance' (closer neighbors weighted higher) └─ random_state=0 (imbalanced) / 42 (balanced)
</details>
<details>
<summary><b>🌲 Random Forest Setup</b></summary>
<br>
```python
# Data Preparation
✓ One-hot encoding for categorical features
✓ No scaling required (tree-based model)
✓ 80-20 stratified train-test split
# Model Parameters
├─ n_estimators=100 (100 decision trees)
├─ max_depth=None (full depth for max learning)
├─ min_samples_split=2 (default)
├─ min_samples_leaf=1 (detailed patterns)
└─ random_state=1 (imbalanced) / 42 (balanced)
⚖️ SMOTENC Configuration
```python # SMOTENC Parameters ├─ random_state=42 (reproducibility) ├─ categorical_features=[1, 2, 5, 11] (specify categorical indices) ├─ k_neighbors=5 (default for synthetic sample generation) └─ sampling_strategy='auto' (balance to 50-50)
Before: 7,600 rejections | 1,500 approvals (5:1 ratio) After: 7,632 rejections | 7,632 approvals (1:1 ratio)
</details>
<br>
### **📏 Evaluation Metrics**
<table>
<tr>
<td width="20%" align="center">
**Accuracy**
<sub>Overall correctness</sub>
</td>
<td width="20%" align="center">
**Precision**
<sub>Predicted approvals<br>that were correct</sub>
</td>
<td width="20%" align="center">
**Recall**
<sub>Actual approvals<br>we caught</sub>
</td>
<td width="20%" align="center">
**F1-Score**
<sub>Harmonic mean of<br>Precision & Recall</sub>
</td>
<td width="20%" align="center">
**ROC-AUC**
<sub>Discrimination<br>ability</sub>
</td>
</tr>
</table>
---
<div align="center">
## 📈 **RESULTS**
</div>
### **🔴 Performance on Imbalanced Data**
<table>
<tr>
<td width="70%">
| Model | Accuracy | Precision | Recall | F1-Score | Status |
|-------|:--------:|:---------:|:------:|:--------:|:------:|
| Naive Bayes | 81-82% | - | - | - | ❌ Excluded |
| Decision Tree | 92% | 89% | **56%** ⚠️ | - | Poor Recall |
| **KNN** | 90.3% | 82.4% | **55%** ⚠️ | 66% | Poor Recall |
| **Random Forest** | **93%** ✅ | 93% | 93% | 92% | Best |
</td>
<td width="30%">
### **🚨 Critical Issue**
<br>
REJECTING QUALIFIED APPLICANTS
┌───────────┐ │ Decision │ │ Tree │ 44% ❌ ├───────────┤ │ KNN │ 45% ❌ └───────────┘
<br>
**Nearly HALF rejected!**
</td>
</tr>
</table>
---
### **🟢 Performance on Balanced Data (SMOTENC)**
<div align="center">
<table>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-Score</th>
<th>ROC-AUC</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Decision Tree</b></td>
<td>91%</td>
<td>89%</td>
<td><b>91%</b> ⬆️</td>
<td>~90%</td>
<td>-</td>
</tr>
<tr>
<td><b>KNN</b></td>
<td>90.43%</td>
<td>86.91%</td>
<td><b>95.20%</b> ⬆️⬆️</td>
<td>90.86%</td>
<td>0.9599</td>
</tr>
<tr style="background-color: #d4edda;">
<td><b>Random Forest</b></td>
<td><b>94.71%</b> 🏆</td>
<td><b>94.79%</b> 🏆</td>
<td><b>94.61%</b></td>
<td><b>94.70%</b> 🏆</td>
<td><b>0.9912</b> 🏆</td>
</tr>
</tbody>
</table>
</div>
<br>
<table>
<tr>
<td width="50%">
### **📊 SMOTENC Impact Summary**
| Model | Metric | Before | After | Improvement |
|-------|--------|--------|-------|-------------|
| **KNN** | Recall | 55% | 95.20% | <span style="color: green;">**+40.20 pts**</span> 🚀 |
| **Decision Tree** | Recall | 56% | 91% | <span style="color: green;">**+35 pts**</span> 🚀 |
| **Random Forest** | Accuracy | 93% | 94.71% | <span style="color: green;">**+1.71 pts**</span> ✅ |
</td>
<td width="50%">
### **🎯 Winner: Random Forest**
<br>
✅ Best Accuracy: 94.71% ✅ Best Precision: 94.79% ✅ Best F1-Score: 94.70% ✅ Best ROC-AUC: 0.9912
⚖️ Balanced Errors: FP Rate: 2.60% FN Rate: 2.70%
🏆 Production-Ready Model
</td>
</tr>
</table>
<br>
<div align="center">
<!--
📊 ADD IMAGE HERE: "confusion_matrices.png"
2x2 grid showing confusion matrices
Top: KNN (Imbalanced vs Balanced)
Bottom: Random Forest (Imbalanced vs Balanced)
-->
<img src="visualizations/confusion_matrices.png" alt="Confusion Matrices Comparison" width="90%"/>
<sub>*Figure 3: Confusion matrices showing dramatic improvement from class balancing*</sub>
</div>
---
<div align="center">
## 🔍 **KEY FINDINGS**
</div>
### **1️⃣ Class Balancing Changed Everything**
<table>
<tr>
<td width="33%" align="center">
### **KNN**
<h1>🚀</h1>
<br>
**55%** → **95.20%**
<sub>Recall improved by<br><b>+40.20 percentage points</b></sub>
<br>
🎯 **Most Dramatic Improvement**
</td>
<td width="33%" align="center">
### **Decision Tree**
<h1>📈</h1>
<br>
**56%** → **91%**
<sub>Recall improved by<br><b>+35 percentage points</b></sub>
<br>
📊 **Significant Boost**
</td>
<td width="33%" align="center">
### **Random Forest**
<h1>🏆</h1>
<br>
**93%** → **94.71%**
<sub>Maintained excellence<br><b>Balanced predictions</b></sub>
<br>
👑 **Best Overall**
</td>
</tr>
</table>
<br>
### **2️⃣ Feature Importance Insights**
<div align="center">
<!--
📊 ADD IMAGE HERE: "feature_importance_rf.png"
Horizontal bar chart of top 15 features
Random Forest Gini importance
Green color scheme
-->
<img src="visualizations/feature_importance_rf.png" alt="Feature Importance Analysis" width="85%"/>
<sub>*Figure 4: Top features driving loan approval predictions (Random Forest)*</sub>
</div>
<br>
<table>
<tr>
<td width="50%">
### **🔝 Top 5 Predictors**
| Rank | Feature | Importance | 📊 |
|:----:|---------|:----------:|:--:|
| 🥇 | `previous_loan_defaults` | **36.4%** | ████████████████ |
| 🥈 | `loan_percent_income` | 12.2% | █████ |
| 🥉 | `person_income` | 9.6% | ████ |
| 4️⃣ | `loan_int_rate` | 9.2% | ████ |
| 5️⃣ | `person_home_ownership_RENT` | 8.5% | ███ |
</td>
<td width="50%">
### **💡 Key Discovery**
<br>
> **🎓 Education Level: Minimal Impact**
>
> Despite filtering for education loans, the applicant's
> education level (High School, Bachelor's, Master's, etc.)
> had **negligible predictive power**.
<br>
### **What Actually Matters:**
✅ **Financial responsibility** (loan defaults)
✅ **Debt-to-income ratio**
✅ **Income & earning capacity**
✅ **Credit risk indicators**
❌ **NOT educational credentials**
</td>
</tr>
</table>
<br>
### **3️⃣ Model Performance Trade-offs**
<div align="center">
<!--
📊 ADD IMAGE HERE: "roc_curves.png"
Overlaid ROC curves for KNN and Random Forest
KNN in blue, Random Forest in green
Should show Random Forest closer to perfect discrimination
-->
<img src="visualizations/roc_curves.png" alt="ROC Curve Comparison" width="70%"/>
<sub>*Figure 5: ROC curves demonstrating superior discrimination ability*</sub>
</div>
<br>
<table>
<tr>
<td width="50%">
### **🌲 Random Forest Strengths**
✅ **Best Overall Accuracy** (94.71%)
✅ **Best Precision** (94.79%)
✅ **Best F1-Score** (94.70%)
✅ **Near-Perfect ROC-AUC** (0.9912)
✅ **Balanced Error Rates** (2.6% FP / 2.7% FN)
✅ **Low False Positives** (Only 79 wrongly approved)
**💼 Best for Production Deployment**
</td>
<td width="50%">
### **🎯 KNN Strengths**
✅ **Highest Recall** (95.20%)
✅ **Catches Almost All Approvals** (4.8% miss rate)
✅ **Excellent ROC-AUC** (0.9599)
⚠️ **Higher False Positives** (7.17% vs 2.60%)
⚠️ **Lower Precision** (86.91% vs 94.79%)
**💡 Best When Missing Approvals is Costly**
</td>
</tr>
</table>
---
<div align="center">
## 🛠️ **TECH STACK**
</div>
<table>
<tr>
<td align="center" width="20%">
<h1>🐍</h1>
**Python 3.8+**
Core Language
</td>
<td align="center" width="20%">
<h1>🤖</h1>
**scikit-learn**
ML Framework
</td>
<td align="center" width="20%">
<h1>🔢</h1>
**NumPy**
Numerical Computing
</td>
<td align="center" width="20%">
<h1>🐼</h1>
**Pandas**
Data Manipulation
</td>
<td align="center" width="20%">
<h1>📊</h1>
**Matplotlib & Seaborn**
Visualization
</td>
</tr>
</table>
<br>
<div align="center">







</div>
<br>
### **🤖 Algorithms & Techniques**
<table>
<tr>
<td>
**Machine Learning Models:**
- 🌲 Random Forest Classifier
- 🎯 K-Nearest Neighbors (KNN)
- 🌳 Decision Trees
- 📊 Naive Bayes (Gaussian & Categorical)
</td>
<td>
**Data Processing:**
- ⚖️ SMOTENC (Class Balancing)
- 📏 StandardScaler (Feature Scaling)
- 🔄 One-Hot Encoding
- ✂️ Train-Test Split (Stratified)
</td>
<td>
**Development Tools:**
- 📓 Jupyter Notebooks
- ☁️ Google Colab
- 🔧 Git & GitHub
- 📊 Pandas & NumPy
</td>
</tr>
</table>
---
<div align="center">
## 🚀 **QUICK START**
</div>
<table>
<tr>
<td width="50%">
### **⚡ Option 1: Google Colab** (Recommended)
**No installation required!** Click below to run in your browser:
<br>
<div align="center">
[](YOUR_COLAB_LINK_HERE)
</div>
<br>
**Features:**
- ✅ Free GPU access
- ✅ Pre-installed libraries
- ✅ Instant execution
- ✅ Easy sharing
<br>
**Just click and run!** 🎯
</td>
<td width="50%">
### **💻 Option 2: Run Locally**
```bash
# 1. Clone repository
git clone https://github.com/YOUR_USERNAME/gradient-capital.git
cd gradient-capital
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download dataset from Kaggle
# https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/
# 5. Launch Jupyter
jupyter notebook
# 6. Open notebook
# notebooks/gradient_capital.ipynb
|
Lead Developer & ML Engineer |
Data Scientist & Model Testing |
Data Preprocessing & Analysis |
🎓 Fordham University | 📚 Data Mining (CISC 4800) | 📅 Fall 2024
|
|
This project is licensed under the MIT License
See the LICENSE file for details
Special thanks to:
Dr. [Professor Name], Fordham University • Kaggle Community • scikit-learn Team • imbalanced-learn Developers
Questions? Opportunities? Collaboration?
Built with ❤️ by Bryan Pineda, Carl Delos Santos, and Katherine Bonilla
Fordham University • Fall 2024




