Skip to content

This is a full ML model to valuate and predict income. This project sets a credit score and a risk lending exposure. The model also builds predictions and visualizes projections based on current and past data.

Notifications You must be signed in to change notification settings

aldo2287/Machine-Learning-Income-Credit-Valuation-and-Visualization-Model-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Credit Score Prediction Model

This project aims to build a predictive model to estimate a person's annual income based on various features such as employment details, education, family structure, and more. Using PySpark for data preprocessing, feature engineering, and model building, the project explores both Linear Regression and Random Forest Regressor for predicting the Annual_income.

The dataset, Credit_card.csv, contains various attributes related to individuals' financial and demographic information, which is used to build the predictive model.

Technologies:

Python: Programming language used to write the code. PySpark: Framework for big data processing. JDK 8: Java Development Kit required for PySpark. Matplotlib: Library for plotting results. Pandas: For working with data in DataFrame format.

Installation:

Follow these steps to get the environment up and running: Install Java 8: Required for Spark. Install JDK 8 Install Spark: Download and install Spark by following these instructions.

Install Python dependencies:

bash Copy Edit pip install pyspark matplotlib pandas

Download the dataset:

Ensure you have the Credit_card.csv dataset in the same directory as the script or update the file path in the code accordingly. Dataset Description

The dataset Credit_card.csv contains various attributes related to individuals, including:

Type_Occupation: Occupation type (e.g., business, student, retired).

GENDER: Gender of the individual.

Car_Owner: Whether the individual owns a car (Y/N).

Propert_Owner: Whether the individual owns property (Y/N).

Housing_type: Type of housing (e.g., apartment, house).

Annual_income: Annual income of the individual (target variable).

Family_Members: Number of family members supported by the individual.

Employed_days: The number of days the individual has been employed.

Education: Education level (e.g., higher education, academic degree).

Marital_status: Marital status (e.g., married, single).

Steps to Run the Code:

Setup Environment: Make sure you've installed Java 8, Spark, and Python libraries as described above.

Run the Script:

Open your terminal or command prompt.

Navigate to the directory containing the Python script and dataset.

Execute the script using Python:

bash Copy Edit python credit_score_prediction.py Data Preprocessing and Feature Engineering

The data preprocessing steps include:

Handling Missing Data: Missing values in Type_Occupation are filled with 'NA', and other rows are dropped if any nulls are present in critical columns.

Feature Transformation:

Negative Values: Absolute values are used for Birthday_count and Employed_days to ensure positive values for age and employment.

Age Calculation: The age is calculated using the Birthday_count field and today's date.

Income per Family Member: For those with family members, the Annual_income is divided by the Family_Members column.

Categorical Features Encoding:

Categorical columns like GENDER, Car_Owner, Propert_Owner, and others are encoded using StringIndexer and OneHotEncoder.

Feature Selection: The final model includes features like Income_per_Family_Member, Age, FR_score, and others.

Model Building and Evaluation

Two machine learning models are implemented and evaluated:

Simple Linear Regression (SLR):

The LinearRegression model is trained on the processed features.

Evaluation metric: Root Mean Squared Error (RMSE).

Random Forest Regressor (RF):

The RandomForestRegressor model is trained with 100 trees and a depth of 5.

Evaluation metric: RMSE.

Both models are evaluated on the test set, and the predicted annual income is compared with the actual values.

Results

After training both models, the following results were obtained:

Linear Regression RMSE:

plaintext Copy Edit Root Mean Squared Error (RMSE) on test data for Simple Linear Regression = 55527.11877933004 Random Forest RMSE: The evaluation results of the Random Forest model were visually plotted, showing the correlation between actual and predicted annual income.

Licenses and Credits

PySpark: The project uses the PySpark framework for data processing and machine learning.

Matplotlib: Used for plotting the predictions vs actual values.

Pandas: Used for data manipulation and conversion to DataFrame for ease of handling.

About

This is a full ML model to valuate and predict income. This project sets a credit score and a risk lending exposure. The model also builds predictions and visualizes projections based on current and past data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published