This end-to-end machine learning project predicts loan approval status based on customer profiles. Built using Python and scikit-learn, it covers data preprocessing, feature scaling, model training, evaluation, and hyperparameter tuning. Additionally, a user-friendly GUI was developed for seamless user interaction.
- Data Loading
- Exploratory Data Analysis (EDA)
- Data Cleaning & Missing Value Handling
- Feature Engineering (Encoding Categorical Features)
- Feature Scaling
- Model Training and Evaluation
- Hyperparameter Tuning
- Cross-Validation (K-Fold)
- Model Deployment & GUI Development
- Approach: Implemented StandardScaler to standardize numerical features like income and loan amount.
- Purpose: Essential for distance-based models such as Logistic Regression and SVC to ensure fair weight distribution among features.
- Approach: Utilized Label Encoding to transform categorical variables into numeric values for model compatibility.
- Example Mappings:
"Yes" → 1, "No" → 0
"Male" → 1, "Female" → 0
"Urban" → 2, "Semi-Urban" → 1, "Rural" → 0
- Approach: Applied 5-Fold Cross-Validation using cross_val_score for robust performance evaluation.
- Purpose: Helps ensure the model generalizes well across unseen data, mitigating overfitting risks.
- Approach: Employed RandomizedSearchCV to optimize model hyperparameters.
- Examples:
- Logistic Regression & SVC optimized for the C (regularization) parameter.
- Outcome: Significant performance boost, especially for SVC and Random Forest models.
- Credit History: The most critical factor influencing loan approval.
- Income: Applicant and co-applicant incomes significantly impacted the decision-making process.
Baseline Performance: Logistic Regression and SVC performed reasonably well but required tuning.
- Post-Tuning Results:
- Random Forest: Achieved the highest accuracy (~80%) after tuning.
- SVC: Performance improved post-tuning but remained slightly behind Random Forest.
- Minimal Missing Data (<5%): Columns like gender and dependents were dropped.
- Key Columns: Imputed columns such as self_employed and credit_history using the most frequent value (mode).
Why It Mattered: Standardized features (income, loan amount, loan term) ensured no single feature dominated due to scale differences.
Outcome: The 5-Fold Cross-Validation demonstrated that the Random Forest model generalizes effectively across various data splits, making it the top performer.




