🎯

Transforming Student Loan Approvals with Machine Learning

🚀 Live Demo • 📊 Full Report • 👥 Meet the Team

🎯 AT A GLANCE

🎯

94.71%

_Accuracy

🚀

95.20%

_{Recall (KNN)}

📈

+40 pts

_{Recall Boost}

🏆

0.9912

_ROC-AUC

📑 QUICK NAVIGATION

🎯 CORE	📊 DATA & METHODS	💡 INSIGHTS	🚀 MORE
Overview	Dataset	Key Findings	Quick Start
The Problem	Methodology	Visualizations	Team
Our Solution	Results	Tech Stack	References

🌟 OVERVIEW

Gradient Capital is a cutting-edge machine learning solution that revolutionizes student loan approval decisions. By leveraging advanced ensemble methods and SMOTENC balancing, we transformed biased, inefficient models into production-ready systems that make fair, accurate, and consistent lending decisions.

🏆 Why This Project Stands Out

+ 94.71% accuracy with Random Forest
+ Near-perfect discrimination (0.9912 ROC-AUC)
+ 40-point recall improvement through smart balancing
+ Production-ready models for real-world deployment
+ Transparent feature importance for explainable AI

⚡ Quick Stats

Metric	Achievement
📈 Best Model	Random Forest
🎯 Accuracy	94.71%
🔍 Precision	94.79%
🎪 Recall	95.20% (KNN)
📊 F1-Score	94.70%
🚀 ROC-AUC	0.9912
⚖️ Balance	2.6% FP / 2.7% FN

_{Figure 1: Comprehensive model performance comparison on balanced dataset}

🚨 THE PROBLEM

Traditional Lending Challenges

⏱️ Slow & Manual

Manual loan reviews are:

Time-consuming
Labor-intensive
Expensive to operate
Prone to delays

🎲 Inconsistent & Biased

Human decision-making leads to:

Subjective evaluations
Demographic prejudice
Lack of transparency
Unfair outcomes

📉 Poor Performance

Traditional systems suffer from:

Rejecting qualified applicants
Missed revenue opportunities
Customer dissatisfaction
High error rates

Our Initial ML Challenge: Class Imbalance

Original Dataset Distribution

🔴 Rejected Loans:  ████████████████  7,600
🟢 Approved Loans:  ███               1,500

Ratio: 83% Rejected : 17% Approved

Result: Models heavily biased toward rejection!

→

Impact on Model Performance

Model	Recall	Rejected Qualified
Decision Tree	56%	44% ❌
KNN	55%	45% ❌

Nearly HALF of qualified borrowers were incorrectly denied!

💡 OUR SOLUTION

🔬 STEP 1

Data Preparation

📊

Clean 45K loan records
Filter to 9,153 education loans
Handle outliers & binning
One-hot encode features

⚖️ STEP 2

SMOTENC Balancing

🔄

Synthetic minority sampling
Achieve 50-50 class balance
Preserve feature distributions
Smart categorical handling

🤖 STEP 3

Model Training

🧠

Test 4 ML algorithms
Optimize hyperparameters
5-fold cross-validation
Comprehensive evaluation

🎯 The Transformation

BEFORE SMOTENC ❌

Imbalanced Data
┌─────────────────┐
│  Rejections: 83%│
│  Approvals: 17% │
└─────────────────┘
        ↓
   Poor Recall
   (55-56%)
        ↓
   Rejected 45% of
   Qualified Applicants

→

AFTER SMOTENC ✅

Balanced Data
┌─────────────────┐
│  Rejections: 50%│
│  Approvals: 50% │
└─────────────────┘
        ↓
  Excellent Recall
   (95.20%)
        ↓
  Fair & Accurate
  Predictions

_{Figure 2: Class distribution before and after SMOTENC application}

📊 DATASET

📁 Source & Statistics

Origin: Kaggle Loan Approval Dataset

Property	Value
Total Records	45,000 loans
After Filtering	9,153 education loans
Features	14 attributes
Target	Binary (Approved/Rejected)
Missing Values	0 (Clean dataset ✅)
Class Ratio (Original)	83% : 17% (Imbalanced)
Class Ratio (SMOTENC)	50% : 50% (Balanced)

🔑 Key Features

💰 Financial Indicators

person_income - Annual income
loan_percent_income - Debt-to-income ratio
credit_score - Credit worthiness
loan_int_rate - Interest rate
loan_amnt - Loan amount

📋 Applicant Profile

person_emp_exp - Employment years
person_age - Applicant age
person_home_ownership - Rent/Own/Mortgage
person_education - Education level

🚨 Risk Factors

previous_loan_defaults - Default history
cb_person_cred_hist_length - Credit history length

🔬 METHODOLOGY

🛠️ Data Preprocessing Pipeline

graph LR
    A[45K Loans] --> B[Filter Education<br/>9,153 loans]
    B --> C[Remove Outliers<br/>Upper Bound]
    C --> D[Equal-Width<br/>Binning]
    D --> E[One-Hot<br/>Encoding]
    E --> F[Train-Test Split<br/>80-20]
    F --> G[Ready for<br/>Training]
    
    style A fill:#e1f5ff
    style G fill:#d4edda

🤖 Models Tested

📉

Naive Bayes

Probabilistic
Baseline Model

81-82% Accuracy

🌳

Decision Tree

Rule-Based
Interpretable

91% Accuracy

🎯

K-Nearest Neighbors

Distance-Based
High Recall

90.43% Accuracy

🌲

Random Forest

Ensemble
Best Overall

94.71% Accuracy ✅

⚙️ Model Configuration

🎯 K-Nearest Neighbors Setup

```python # Data Preparation ✓ StandardScaler (critical for distance calculations) ✓ One-hot encoding for categorical features ✓ 80-20 stratified train-test split

Model Parameters

├─ Imbalanced Data: K=18 (via 5-fold CV) ├─ Balanced Data: K=5 (re-optimized) ├─ weights='distance' (closer neighbors weighted higher) └─ random_state=0 (imbalanced) / 42 (balanced)


</details>

<details>
<summary><b>🌲 Random Forest Setup</b></summary>

<br>
```python
# Data Preparation
✓ One-hot encoding for categorical features
✓ No scaling required (tree-based model)
✓ 80-20 stratified train-test split

# Model Parameters
├─ n_estimators=100 (100 decision trees)
├─ max_depth=None (full depth for max learning)
├─ min_samples_split=2 (default)
├─ min_samples_leaf=1 (detailed patterns)
└─ random_state=1 (imbalanced) / 42 (balanced)

⚖️ SMOTENC Configuration

```python # SMOTENC Parameters ├─ random_state=42 (reproducibility) ├─ categorical_features=[1, 2, 5, 11] (specify categorical indices) ├─ k_neighbors=5 (default for synthetic sample generation) └─ sampling_strategy='auto' (balance to 50-50)

Result

Before: 7,600 rejections | 1,500 approvals (5:1 ratio) After: 7,632 rejections | 7,632 approvals (1:1 ratio)


</details>

<br>

### **📏 Evaluation Metrics**

<table>
<tr>
<td width="20%" align="center">

**Accuracy**  
<sub>Overall correctness</sub>

</td>
<td width="20%" align="center">

**Precision**  
<sub>Predicted approvals<br>that were correct</sub>

</td>
<td width="20%" align="center">

**Recall**  
<sub>Actual approvals<br>we caught</sub>

</td>
<td width="20%" align="center">

**F1-Score**  
<sub>Harmonic mean of<br>Precision & Recall</sub>

</td>
<td width="20%" align="center">

**ROC-AUC**  
<sub>Discrimination<br>ability</sub>

</td>
</tr>
</table>

---

<div align="center">

## 📈 **RESULTS**

</div>

### **🔴 Performance on Imbalanced Data**

<table>
<tr>
<td width="70%">

| Model | Accuracy | Precision | Recall | F1-Score | Status |
|-------|:--------:|:---------:|:------:|:--------:|:------:|
| Naive Bayes | 81-82% | - | - | - | ❌ Excluded |
| Decision Tree | 92% | 89% | **56%** ⚠️ | - | Poor Recall |
| **KNN** | 90.3% | 82.4% | **55%** ⚠️ | 66% | Poor Recall |
| **Random Forest** | **93%** ✅ | 93% | 93% | 92% | Best |

</td>
<td width="30%">

### **🚨 Critical Issue**

<br>

REJECTING QUALIFIED APPLICANTS

┌───────────┐ │ Decision │ │ Tree │ 44% ❌ ├───────────┤ │ KNN │ 45% ❌ └───────────┘


<br>

**Nearly HALF rejected!**

</td>
</tr>
</table>

---

### **🟢 Performance on Balanced Data (SMOTENC)**

<div align="center">

<table>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-Score</th>
<th>ROC-AUC</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Decision Tree</b></td>
<td>91%</td>
<td>89%</td>
<td><b>91%</b> ⬆️</td>
<td>~90%</td>
<td>-</td>
</tr>
<tr>
<td><b>KNN</b></td>
<td>90.43%</td>
<td>86.91%</td>
<td><b>95.20%</b> ⬆️⬆️</td>
<td>90.86%</td>
<td>0.9599</td>
</tr>
<tr style="background-color: #d4edda;">
<td><b>Random Forest</b></td>
<td><b>94.71%</b> 🏆</td>
<td><b>94.79%</b> 🏆</td>
<td><b>94.61%</b></td>
<td><b>94.70%</b> 🏆</td>
<td><b>0.9912</b> 🏆</td>
</tr>
</tbody>
</table>

</div>

<br>

<table>
<tr>
<td width="50%">

### **📊 SMOTENC Impact Summary**

| Model | Metric | Before | After | Improvement |
|-------|--------|--------|-------|-------------|
| **KNN** | Recall | 55% | 95.20% | <span style="color: green;">**+40.20 pts**</span> 🚀 |
| **Decision Tree** | Recall | 56% | 91% | <span style="color: green;">**+35 pts**</span> 🚀 |
| **Random Forest** | Accuracy | 93% | 94.71% | <span style="color: green;">**+1.71 pts**</span> ✅ |

</td>
<td width="50%">

### **🎯 Winner: Random Forest**

<br>

✅ Best Accuracy: 94.71% ✅ Best Precision: 94.79% ✅ Best F1-Score: 94.70% ✅ Best ROC-AUC: 0.9912

⚖️ Balanced Errors: FP Rate: 2.60% FN Rate: 2.70%

🏆 Production-Ready Model


</td>
</tr>
</table>

<br>

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "confusion_matrices.png"
2x2 grid showing confusion matrices
Top: KNN (Imbalanced vs Balanced)
Bottom: Random Forest (Imbalanced vs Balanced)
-->

<img src="visualizations/confusion_matrices.png" alt="Confusion Matrices Comparison" width="90%"/>

<sub>*Figure 3: Confusion matrices showing dramatic improvement from class balancing*</sub>

</div>

---

<div align="center">

## 🔍 **KEY FINDINGS**

</div>

### **1️⃣ Class Balancing Changed Everything**

<table>
<tr>
<td width="33%" align="center">

### **KNN**

<h1>🚀</h1>

<br>

**55%** → **95.20%**

<sub>Recall improved by<br><b>+40.20 percentage points</b></sub>

<br>

🎯 **Most Dramatic Improvement**

</td>
<td width="33%" align="center">

### **Decision Tree**

<h1>📈</h1>

<br>

**56%** → **91%**

<sub>Recall improved by<br><b>+35 percentage points</b></sub>

<br>

📊 **Significant Boost**

</td>
<td width="33%" align="center">

### **Random Forest**

<h1>🏆</h1>

<br>

**93%** → **94.71%**

<sub>Maintained excellence<br><b>Balanced predictions</b></sub>

<br>

👑 **Best Overall**

</td>
</tr>
</table>

<br>

### **2️⃣ Feature Importance Insights**

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "feature_importance_rf.png"
Horizontal bar chart of top 15 features
Random Forest Gini importance
Green color scheme
-->

<img src="visualizations/feature_importance_rf.png" alt="Feature Importance Analysis" width="85%"/>

<sub>*Figure 4: Top features driving loan approval predictions (Random Forest)*</sub>

</div>

<br>

<table>
<tr>
<td width="50%">

### **🔝 Top 5 Predictors**

| Rank | Feature | Importance | 📊 |
|:----:|---------|:----------:|:--:|
| 🥇 | `previous_loan_defaults` | **36.4%** | ████████████████ |
| 🥈 | `loan_percent_income` | 12.2% | █████ |
| 🥉 | `person_income` | 9.6% | ████ |
| 4️⃣ | `loan_int_rate` | 9.2% | ████ |
| 5️⃣ | `person_home_ownership_RENT` | 8.5% | ███ |

</td>
<td width="50%">

### **💡 Key Discovery**

<br>

> **🎓 Education Level: Minimal Impact**
>
> Despite filtering for education loans, the applicant's  
> education level (High School, Bachelor's, Master's, etc.)  
> had **negligible predictive power**.

<br>

### **What Actually Matters:**

✅ **Financial responsibility** (loan defaults)  
✅ **Debt-to-income ratio**  
✅ **Income & earning capacity**  
✅ **Credit risk indicators**  

❌ **NOT educational credentials**

</td>
</tr>
</table>

<br>

### **3️⃣ Model Performance Trade-offs**

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "roc_curves.png"
Overlaid ROC curves for KNN and Random Forest
KNN in blue, Random Forest in green
Should show Random Forest closer to perfect discrimination
-->

<img src="visualizations/roc_curves.png" alt="ROC Curve Comparison" width="70%"/>

<sub>*Figure 5: ROC curves demonstrating superior discrimination ability*</sub>

</div>

<br>

<table>
<tr>
<td width="50%">

### **🌲 Random Forest Strengths**

✅ **Best Overall Accuracy** (94.71%)  
✅ **Best Precision** (94.79%)  
✅ **Best F1-Score** (94.70%)  
✅ **Near-Perfect ROC-AUC** (0.9912)  
✅ **Balanced Error Rates** (2.6% FP / 2.7% FN)  
✅ **Low False Positives** (Only 79 wrongly approved)  

**💼 Best for Production Deployment**

</td>
<td width="50%">

### **🎯 KNN Strengths**

✅ **Highest Recall** (95.20%)  
✅ **Catches Almost All Approvals** (4.8% miss rate)  
✅ **Excellent ROC-AUC** (0.9599)  
⚠️ **Higher False Positives** (7.17% vs 2.60%)  
⚠️ **Lower Precision** (86.91% vs 94.79%)  

**💡 Best When Missing Approvals is Costly**

</td>
</tr>
</table>

---

<div align="center">

## 🛠️ **TECH STACK**

</div>

<table>
<tr>
<td align="center" width="20%">

<h1>🐍</h1>

**Python 3.8+**

Core Language

</td>
<td align="center" width="20%">

<h1>🤖</h1>

**scikit-learn**

ML Framework

</td>
<td align="center" width="20%">

<h1>🔢</h1>

**NumPy**

Numerical Computing

</td>
<td align="center" width="20%">

<h1>🐼</h1>

**Pandas**

Data Manipulation

</td>
<td align="center" width="20%">

<h1>📊</h1>

**Matplotlib & Seaborn**

Visualization

</td>
</tr>
</table>

<br>

<div align="center">

![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
![NumPy](https://img.shields.io/badge/NumPy-013243?style=for-the-badge&logo=numpy&logoColor=white)
![Pandas](https://img.shields.io/badge/Pandas-150458?style=for-the-badge&logo=pandas&logoColor=white)
![scikit-learn](https://img.shields.io/badge/scikit--learn-F7931E?style=for-the-badge&logo=scikit-learn&logoColor=white)
![Jupyter](https://img.shields.io/badge/Jupyter-F37626?style=for-the-badge&logo=jupyter&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-11557c?style=for-the-badge&logo=python&logoColor=white)
![Seaborn](https://img.shields.io/badge/Seaborn-3776AB?style=for-the-badge&logo=python&logoColor=white)

</div>

<br>

### **🤖 Algorithms & Techniques**

<table>
<tr>
<td>

**Machine Learning Models:**
- 🌲 Random Forest Classifier
- 🎯 K-Nearest Neighbors (KNN)
- 🌳 Decision Trees
- 📊 Naive Bayes (Gaussian & Categorical)

</td>
<td>

**Data Processing:**
- ⚖️ SMOTENC (Class Balancing)
- 📏 StandardScaler (Feature Scaling)
- 🔄 One-Hot Encoding
- ✂️ Train-Test Split (Stratified)

</td>
<td>

**Development Tools:**
- 📓 Jupyter Notebooks
- ☁️ Google Colab
- 🔧 Git & GitHub
- 📊 Pandas & NumPy

</td>
</tr>
</table>

---

<div align="center">

## 🚀 **QUICK START**

</div>

<table>
<tr>
<td width="50%">

### **⚡ Option 1: Google Colab** (Recommended)

**No installation required!** Click below to run in your browser:

<br>

<div align="center">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](YOUR_COLAB_LINK_HERE)

</div>

<br>

**Features:**
- ✅ Free GPU access
- ✅ Pre-installed libraries
- ✅ Instant execution
- ✅ Easy sharing

<br>

**Just click and run!** 🎯

</td>
<td width="50%">

### **💻 Option 2: Run Locally**
```bash
# 1. Clone repository
git clone https://github.com/YOUR_USERNAME/gradient-capital.git
cd gradient-capital

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download dataset from Kaggle
# https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/

# 5. Launch Jupyter
jupyter notebook

# 6. Open notebook
# notebooks/gradient_capital.ipynb

👥 TEAM

The Data Scientists Behind Gradient Capital

Bryan Pineda

_{Lead Developer & ML Engineer}

Carl Delos Santos

_{Data Scientist & Model Testing}

Katherine Bonilla

_{Data Preprocessing & Analysis}

🎓 Fordham University | 📚 Data Mining (CISC 4800) | 📅 Fall 2024

📚 REFERENCES

📊 Data & Research

Dataset Source:
Loan Approval Classification Data. Kaggle.
https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/
SMOTENC Algorithm:
Chawla, N. V., et al. (2002).
SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research, 16, 321-357.
Random Forest:
Breiman, L. (2001).
Random Forests.
Machine Learning, 45(1), 5-32.

🛠️ Tools & Libraries

scikit-learn:
https://scikit-learn.org/
imbalanced-learn:
https://imbalanced-learn.org/
pandas:
https://pandas.pydata.org/
NumPy:
https://numpy.org/
Matplotlib:
https://matplotlib.org/
Seaborn:
https://seaborn.pydata.org/

📄 LICENSE

This project is licensed under the MIT License

See the LICENSE file for details

🙏 ACKNOWLEDGMENTS

Special thanks to:

Dr. [Professor Name], Fordham University • Kaggle Community • scikit-learn Team • imbalanced-learn Developers

📬 GET IN TOUCH

Questions? Opportunities? Collaboration?

⭐ If you found this project valuable, please star the repository! ⭐

_{Built with ❤️ by Bryan Pineda, Carl Delos Santos, and Katherine Bonilla}

_{Fordham University • Fall 2024}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
README.md		README.md

BryanPineda21/GradientCapital

Folders and files

Latest commit

History

Repository files navigation