A Machine Learning project developed to automatically evaluate and predict customer credit scores for a financial institution.
A bank needs to define the credit score of its customers to safely approve or deny loans and financial products. Instead of analyzing each customer's history manually, this project uses Artificial Intelligence to analyze complex financial data and automatically classify their credit score into three categories: Poor, Standard, or Good.
- Annual Salary & Profession
- Number of Bank Accounts & Credit Cards
- Payment Behavior & Delay Days
- Total Debt & Credit Mix
- Monthly Investments
- Python 3
- Pandas: For data manipulation, cleaning, and preprocessing.
- Scikit-Learn: For building, training, and evaluating the Machine Learning models (e.g., Decision Tree Classifier).
- Jupyter Notebook / VS Code: For interactive development and data exploration.
-
Data Preprocessing: Loading the historical data (
clientes.csv) and applying Label Encoding to transform categorical text data (likeprofissaoandcomportamento_pagamento) into numerical values that the AI can understand. -
Splitting the Data: Dividing the dataset into training features (
$X$ ) and the target labels ($y$ -score_credito). - Model Training: Teaching the machine learning algorithm using thousands of historical customer behaviors.
-
Prediction: Using the trained and validated model to evaluate a new, unseen list of customers (
novos_clientes.csv) and outputting their predicted credit scores.
To ensure the highest possible accuracy in classification, this project trained, tested, and compared two classic and powerful Machine Learning algorithms:
Random Forest is an ensemble learning algorithm. Instead of relying on a single decision tree to evaluate the customer, it creates a "forest" of dozens or hundreds of trees operating simultaneously, where the final decision is made by majority vote.
- Why it was used: It is an extremely robust model. It handles a large number of financial variables perfectly and is highly resistant to overfitting (when the model "memorizes" training data instead of learning), delivering highly reliable predictions for credit risk.
- Accuracy Achieved: 82.63%
KNN is a distance and similarity-based algorithm. It classifies a new customer's score by mapping their mathematical proximity to the "K" most similar customers in the historical database.
- Why it was used: It is a very logical model for the financial market. The premise is simple: if a new customer has a salary, debt level, and payment habits very close to a group of customers who already have a "Good" score, the model assumes this new customer will also have the same score.
- Accuracy Achieved: 73.46%
🏆 Final Model Selection: After the training and testing phase with the historical database, the chosen model to predict the data in the
novos_clientes.csvfile was the Random Forest, as it presented the highest accuracy rate (82.63%) and the best overall performance.
- Clone this repository to your computer.
- Ensure the datasets (
clientes.csvandnovos_clientes.csv) are in the root directory. - Set up your virtual environment and install the dependencies in your terminal:
python3 -m venv venv source venv/bin/activate pip install pandas scikit-learn ipykernel - Open the main.ipynb file in your preferred editor (like VS Code), ensure your Python environment (venv) is selected as the kernel, and run the cells.
Project developed as part of the educational material provided by Hashtag Programação.