Skip to content

BryanPineda21/GradientCapital

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

GRADIENT CAPITAL Typing SVG

Transforming Student Loan Approvals with Machine Learning

Python scikit-learn Jupyter Colab


🎯 AT A GLANCE

🎯

94.71%

Accuracy

🚀

95.20%

Recall (KNN)

📈

+40 pts

Recall Boost

🏆

0.9912

ROC-AUC

📑 QUICK NAVIGATION


🌟 OVERVIEW

Gradient Capital is a cutting-edge machine learning solution that revolutionizes student loan approval decisions. By leveraging advanced ensemble methods and SMOTENC balancing, we transformed biased, inefficient models into production-ready systems that make fair, accurate, and consistent lending decisions.

🏆 Why This Project Stands Out

+ 94.71% accuracy with Random Forest
+ Near-perfect discrimination (0.9912 ROC-AUC)
+ 40-point recall improvement through smart balancing
+ Production-ready models for real-world deployment
+ Transparent feature importance for explainable AI

⚡ Quick Stats

Metric Achievement
📈 Best Model Random Forest
🎯 Accuracy 94.71%
🔍 Precision 94.79%
🎪 Recall 95.20% (KNN)
📊 F1-Score 94.70%
🚀 ROC-AUC 0.9912
⚖️ Balance 2.6% FP / 2.7% FN
Model Performance Comparison

Figure 1: Comprehensive model performance comparison on balanced dataset


🚨 THE PROBLEM

Traditional Lending Challenges

⏱️ Slow & Manual

Manual loan reviews are:

  • Time-consuming
  • Labor-intensive
  • Expensive to operate
  • Prone to delays

🎲 Inconsistent & Biased

Human decision-making leads to:

  • Subjective evaluations
  • Demographic prejudice
  • Lack of transparency
  • Unfair outcomes

📉 Poor Performance

Traditional systems suffer from:

  • Rejecting qualified applicants
  • Missed revenue opportunities
  • Customer dissatisfaction
  • High error rates

Our Initial ML Challenge: Class Imbalance

Original Dataset Distribution

🔴 Rejected Loans:  ████████████████  7,600
🟢 Approved Loans:  ███               1,500

Ratio: 83% Rejected : 17% Approved

Result: Models heavily biased toward rejection!

Impact on Model Performance

Model Recall Rejected Qualified
Decision Tree 56% 44%
KNN 55% 45%

Nearly HALF of qualified borrowers were incorrectly denied!


💡 OUR SOLUTION

🔬 STEP 1

Data Preparation

📊

  • Clean 45K loan records
  • Filter to 9,153 education loans
  • Handle outliers & binning
  • One-hot encode features

⚖️ STEP 2

SMOTENC Balancing

🔄

  • Synthetic minority sampling
  • Achieve 50-50 class balance
  • Preserve feature distributions
  • Smart categorical handling

🤖 STEP 3

Model Training

🧠

  • Test 4 ML algorithms
  • Optimize hyperparameters
  • 5-fold cross-validation
  • Comprehensive evaluation

🎯 The Transformation

BEFORE SMOTENC

Imbalanced Data
┌─────────────────┐
│  Rejections: 83%│
│  Approvals: 17% │
└─────────────────┘
        ↓
   Poor Recall
   (55-56%)
        ↓
   Rejected 45% of
   Qualified Applicants

AFTER SMOTENC

Balanced Data
┌─────────────────┐
│  Rejections: 50%│
│  Approvals: 50% │
└─────────────────┘
        ↓
  Excellent Recall
   (95.20%)
        ↓
  Fair & Accurate
  Predictions
Screenshot 2025-12-13 at 6 02 02 PM Screenshot 2025-12-13 at 6 02 12 PM

Figure 2: Class distribution before and after SMOTENC application


📊 DATASET

📁 Source & Statistics

Origin: Kaggle Loan Approval Dataset

Property Value
Total Records 45,000 loans
After Filtering 9,153 education loans
Features 14 attributes
Target Binary (Approved/Rejected)
Missing Values 0 (Clean dataset ✅)
Class Ratio (Original) 83% : 17% (Imbalanced)
Class Ratio (SMOTENC) 50% : 50% (Balanced)

🔑 Key Features


💰 Financial Indicators

  • person_income - Annual income
  • loan_percent_income - Debt-to-income ratio
  • credit_score - Credit worthiness
  • loan_int_rate - Interest rate
  • loan_amnt - Loan amount

📋 Applicant Profile

  • person_emp_exp - Employment years
  • person_age - Applicant age
  • person_home_ownership - Rent/Own/Mortgage
  • person_education - Education level

🚨 Risk Factors

  • previous_loan_defaults - Default history
  • cb_person_cred_hist_length - Credit history length

🔬 METHODOLOGY

🛠️ Data Preprocessing Pipeline

graph LR
    A[45K Loans] --> B[Filter Education<br/>9,153 loans]
    B --> C[Remove Outliers<br/>Upper Bound]
    C --> D[Equal-Width<br/>Binning]
    D --> E[One-Hot<br/>Encoding]
    E --> F[Train-Test Split<br/>80-20]
    F --> G[Ready for<br/>Training]
    
    style A fill:#e1f5ff
    style G fill:#d4edda
Loading

🤖 Models Tested

📉

Naive Bayes

Probabilistic
Baseline Model

81-82% Accuracy

🌳

Decision Tree

Rule-Based
Interpretable

91% Accuracy

🎯

K-Nearest Neighbors

Distance-Based
High Recall

90.43% Accuracy

🌲

Random Forest

Ensemble
Best Overall

94.71% Accuracy


⚙️ Model Configuration

🎯 K-Nearest Neighbors Setup
```python # Data Preparation ✓ StandardScaler (critical for distance calculations) ✓ One-hot encoding for categorical features ✓ 80-20 stratified train-test split

Model Parameters

├─ Imbalanced Data: K=18 (via 5-fold CV) ├─ Balanced Data: K=5 (re-optimized) ├─ weights='distance' (closer neighbors weighted higher) └─ random_state=0 (imbalanced) / 42 (balanced)


</details>

<details>
<summary><b>🌲 Random Forest Setup</b></summary>

<br>
```python
# Data Preparation
✓ One-hot encoding for categorical features
✓ No scaling required (tree-based model)
✓ 80-20 stratified train-test split

# Model Parameters
├─ n_estimators=100 (100 decision trees)
├─ max_depth=None (full depth for max learning)
├─ min_samples_split=2 (default)
├─ min_samples_leaf=1 (detailed patterns)
└─ random_state=1 (imbalanced) / 42 (balanced)
⚖️ SMOTENC Configuration
```python # SMOTENC Parameters ├─ random_state=42 (reproducibility) ├─ categorical_features=[1, 2, 5, 11] (specify categorical indices) ├─ k_neighbors=5 (default for synthetic sample generation) └─ sampling_strategy='auto' (balance to 50-50)

Result

Before: 7,600 rejections | 1,500 approvals (5:1 ratio) After: 7,632 rejections | 7,632 approvals (1:1 ratio)


</details>

<br>

### **📏 Evaluation Metrics**

<table>
<tr>
<td width="20%" align="center">

**Accuracy**  
<sub>Overall correctness</sub>

</td>
<td width="20%" align="center">

**Precision**  
<sub>Predicted approvals<br>that were correct</sub>

</td>
<td width="20%" align="center">

**Recall**  
<sub>Actual approvals<br>we caught</sub>

</td>
<td width="20%" align="center">

**F1-Score**  
<sub>Harmonic mean of<br>Precision & Recall</sub>

</td>
<td width="20%" align="center">

**ROC-AUC**  
<sub>Discrimination<br>ability</sub>

</td>
</tr>
</table>

---

<div align="center">

## 📈 **RESULTS**

</div>

### **🔴 Performance on Imbalanced Data**

<table>
<tr>
<td width="70%">

| Model | Accuracy | Precision | Recall | F1-Score | Status |
|-------|:--------:|:---------:|:------:|:--------:|:------:|
| Naive Bayes | 81-82% | - | - | - | ❌ Excluded |
| Decision Tree | 92% | 89% | **56%** ⚠️ | - | Poor Recall |
| **KNN** | 90.3% | 82.4% | **55%** ⚠️ | 66% | Poor Recall |
| **Random Forest** | **93%** ✅ | 93% | 93% | 92% | Best |

</td>
<td width="30%">

### **🚨 Critical Issue**

<br>

REJECTING QUALIFIED APPLICANTS

┌───────────┐ │ Decision │ │ Tree │ 44% ❌ ├───────────┤ │ KNN │ 45% ❌ └───────────┘


<br>

**Nearly HALF rejected!**

</td>
</tr>
</table>

---

### **🟢 Performance on Balanced Data (SMOTENC)**

<div align="center">

<table>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-Score</th>
<th>ROC-AUC</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Decision Tree</b></td>
<td>91%</td>
<td>89%</td>
<td><b>91%</b> ⬆️</td>
<td>~90%</td>
<td>-</td>
</tr>
<tr>
<td><b>KNN</b></td>
<td>90.43%</td>
<td>86.91%</td>
<td><b>95.20%</b> ⬆️⬆️</td>
<td>90.86%</td>
<td>0.9599</td>
</tr>
<tr style="background-color: #d4edda;">
<td><b>Random Forest</b></td>
<td><b>94.71%</b> 🏆</td>
<td><b>94.79%</b> 🏆</td>
<td><b>94.61%</b></td>
<td><b>94.70%</b> 🏆</td>
<td><b>0.9912</b> 🏆</td>
</tr>
</tbody>
</table>

</div>

<br>

<table>
<tr>
<td width="50%">

### **📊 SMOTENC Impact Summary**

| Model | Metric | Before | After | Improvement |
|-------|--------|--------|-------|-------------|
| **KNN** | Recall | 55% | 95.20% | <span style="color: green;">**+40.20 pts**</span> 🚀 |
| **Decision Tree** | Recall | 56% | 91% | <span style="color: green;">**+35 pts**</span> 🚀 |
| **Random Forest** | Accuracy | 93% | 94.71% | <span style="color: green;">**+1.71 pts**</span> ✅ |

</td>
<td width="50%">

### **🎯 Winner: Random Forest**

<br>

✅ Best Accuracy: 94.71% ✅ Best Precision: 94.79% ✅ Best F1-Score: 94.70% ✅ Best ROC-AUC: 0.9912

⚖️ Balanced Errors: FP Rate: 2.60% FN Rate: 2.70%

🏆 Production-Ready Model


</td>
</tr>
</table>

<br>

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "confusion_matrices.png"
2x2 grid showing confusion matrices
Top: KNN (Imbalanced vs Balanced)
Bottom: Random Forest (Imbalanced vs Balanced)
-->

<img src="visualizations/confusion_matrices.png" alt="Confusion Matrices Comparison" width="90%"/>

<sub>*Figure 3: Confusion matrices showing dramatic improvement from class balancing*</sub>

</div>

---

<div align="center">

## 🔍 **KEY FINDINGS**

</div>

### **1️⃣ Class Balancing Changed Everything**

<table>
<tr>
<td width="33%" align="center">

### **KNN**

<h1>🚀</h1>

<br>

**55%** → **95.20%**

<sub>Recall improved by<br><b>+40.20 percentage points</b></sub>

<br>

🎯 **Most Dramatic Improvement**

</td>
<td width="33%" align="center">

### **Decision Tree**

<h1>📈</h1>

<br>

**56%** → **91%**

<sub>Recall improved by<br><b>+35 percentage points</b></sub>

<br>

📊 **Significant Boost**

</td>
<td width="33%" align="center">

### **Random Forest**

<h1>🏆</h1>

<br>

**93%** → **94.71%**

<sub>Maintained excellence<br><b>Balanced predictions</b></sub>

<br>

👑 **Best Overall**

</td>
</tr>
</table>

<br>

### **2️⃣ Feature Importance Insights**

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "feature_importance_rf.png"
Horizontal bar chart of top 15 features
Random Forest Gini importance
Green color scheme
-->

<img src="visualizations/feature_importance_rf.png" alt="Feature Importance Analysis" width="85%"/>

<sub>*Figure 4: Top features driving loan approval predictions (Random Forest)*</sub>

</div>

<br>

<table>
<tr>
<td width="50%">

### **🔝 Top 5 Predictors**

| Rank | Feature | Importance | 📊 |
|:----:|---------|:----------:|:--:|
| 🥇 | `previous_loan_defaults` | **36.4%** | ████████████████ |
| 🥈 | `loan_percent_income` | 12.2% | █████ |
| 🥉 | `person_income` | 9.6% | ████ |
| 4️⃣ | `loan_int_rate` | 9.2% | ████ |
| 5️⃣ | `person_home_ownership_RENT` | 8.5% | ███ |

</td>
<td width="50%">

### **💡 Key Discovery**

<br>

> **🎓 Education Level: Minimal Impact**
>
> Despite filtering for education loans, the applicant's  
> education level (High School, Bachelor's, Master's, etc.)  
> had **negligible predictive power**.

<br>

### **What Actually Matters:**

✅ **Financial responsibility** (loan defaults)  
✅ **Debt-to-income ratio**  
✅ **Income & earning capacity**  
✅ **Credit risk indicators**  

❌ **NOT educational credentials**

</td>
</tr>
</table>

<br>

### **3️⃣ Model Performance Trade-offs**

<div align="center">

<!-- 
📊 ADD IMAGE HERE: "roc_curves.png"
Overlaid ROC curves for KNN and Random Forest
KNN in blue, Random Forest in green
Should show Random Forest closer to perfect discrimination
-->

<img src="visualizations/roc_curves.png" alt="ROC Curve Comparison" width="70%"/>

<sub>*Figure 5: ROC curves demonstrating superior discrimination ability*</sub>

</div>

<br>

<table>
<tr>
<td width="50%">

### **🌲 Random Forest Strengths**

✅ **Best Overall Accuracy** (94.71%)  
✅ **Best Precision** (94.79%)  
✅ **Best F1-Score** (94.70%)  
✅ **Near-Perfect ROC-AUC** (0.9912)  
✅ **Balanced Error Rates** (2.6% FP / 2.7% FN)  
✅ **Low False Positives** (Only 79 wrongly approved)  

**💼 Best for Production Deployment**

</td>
<td width="50%">

### **🎯 KNN Strengths**

✅ **Highest Recall** (95.20%)  
✅ **Catches Almost All Approvals** (4.8% miss rate)  
✅ **Excellent ROC-AUC** (0.9599)  
⚠️ **Higher False Positives** (7.17% vs 2.60%)  
⚠️ **Lower Precision** (86.91% vs 94.79%)  

**💡 Best When Missing Approvals is Costly**

</td>
</tr>
</table>

---

<div align="center">

## 🛠️ **TECH STACK**

</div>

<table>
<tr>
<td align="center" width="20%">

<h1>🐍</h1>

**Python 3.8+**

Core Language

</td>
<td align="center" width="20%">

<h1>🤖</h1>

**scikit-learn**

ML Framework

</td>
<td align="center" width="20%">

<h1>🔢</h1>

**NumPy**

Numerical Computing

</td>
<td align="center" width="20%">

<h1>🐼</h1>

**Pandas**

Data Manipulation

</td>
<td align="center" width="20%">

<h1>📊</h1>

**Matplotlib & Seaborn**

Visualization

</td>
</tr>
</table>

<br>

<div align="center">

![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
![NumPy](https://img.shields.io/badge/NumPy-013243?style=for-the-badge&logo=numpy&logoColor=white)
![Pandas](https://img.shields.io/badge/Pandas-150458?style=for-the-badge&logo=pandas&logoColor=white)
![scikit-learn](https://img.shields.io/badge/scikit--learn-F7931E?style=for-the-badge&logo=scikit-learn&logoColor=white)
![Jupyter](https://img.shields.io/badge/Jupyter-F37626?style=for-the-badge&logo=jupyter&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-11557c?style=for-the-badge&logo=python&logoColor=white)
![Seaborn](https://img.shields.io/badge/Seaborn-3776AB?style=for-the-badge&logo=python&logoColor=white)

</div>

<br>

### **🤖 Algorithms & Techniques**

<table>
<tr>
<td>

**Machine Learning Models:**
- 🌲 Random Forest Classifier
- 🎯 K-Nearest Neighbors (KNN)
- 🌳 Decision Trees
- 📊 Naive Bayes (Gaussian & Categorical)

</td>
<td>

**Data Processing:**
- ⚖️ SMOTENC (Class Balancing)
- 📏 StandardScaler (Feature Scaling)
- 🔄 One-Hot Encoding
- ✂️ Train-Test Split (Stratified)

</td>
<td>

**Development Tools:**
- 📓 Jupyter Notebooks
- ☁️ Google Colab
- 🔧 Git & GitHub
- 📊 Pandas & NumPy

</td>
</tr>
</table>

---

<div align="center">

## 🚀 **QUICK START**

</div>

<table>
<tr>
<td width="50%">

### **⚡ Option 1: Google Colab** (Recommended)

**No installation required!** Click below to run in your browser:

<br>

<div align="center">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](YOUR_COLAB_LINK_HERE)

</div>

<br>

**Features:**
- ✅ Free GPU access
- ✅ Pre-installed libraries
- ✅ Instant execution
- ✅ Easy sharing

<br>

**Just click and run!** 🎯

</td>
<td width="50%">

### **💻 Option 2: Run Locally**
```bash
# 1. Clone repository
git clone https://github.com/YOUR_USERNAME/gradient-capital.git
cd gradient-capital

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download dataset from Kaggle
# https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/

# 5. Launch Jupyter
jupyter notebook

# 6. Open notebook
# notebooks/gradient_capital.ipynb

👥 TEAM

The Data Scientists Behind Gradient Capital


Bryan Pineda

Lead Developer & ML Engineer


LinkedIn GitHub Portfolio


Carl Delos Santos

Data Scientist & Model Testing


LinkedIn


Katherine Bonilla

Data Preprocessing & Analysis


LinkedIn


🎓 Fordham University | 📚 Data Mining (CISC 4800) | 📅 Fall 2024


📚 REFERENCES

📊 Data & Research

  1. Dataset Source:
    Loan Approval Classification Data. Kaggle.
    https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/

  2. SMOTENC Algorithm:
    Chawla, N. V., et al. (2002).
    SMOTE: Synthetic Minority Over-sampling Technique.
    Journal of Artificial Intelligence Research, 16, 321-357.

  3. Random Forest:
    Breiman, L. (2001).
    Random Forests.
    Machine Learning, 45(1), 5-32.

🛠️ Tools & Libraries

  1. scikit-learn:
    https://scikit-learn.org/

  2. imbalanced-learn:
    https://imbalanced-learn.org/

  3. pandas:
    https://pandas.pydata.org/

  4. NumPy:
    https://numpy.org/

  5. Matplotlib:
    https://matplotlib.org/

  6. Seaborn:
    https://seaborn.pydata.org/


📄 LICENSE

This project is licensed under the MIT License

See the LICENSE file for details


🙏 ACKNOWLEDGMENTS

Special thanks to:

Dr. [Professor Name], Fordham University • Kaggle Communityscikit-learn Teamimbalanced-learn Developers


📬 GET IN TOUCH

Questions? Opportunities? Collaboration?


Email LinkedIn Portfolio



If you found this project valuable, please star the repository!


Built with ❤️ by Bryan Pineda, Carl Delos Santos, and Katherine Bonilla

Fordham University • Fall 2024

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published