GitHub - aldo2287/Machine-Learning-Income-Credit-Valuation-and-Visualization-Model-Training: This is a full ML model to valuate and predict income. This project sets a credit score and a risk lending exposure. The model also builds predictions and visualizes projections based on current and past data.

Credit Score Prediction Model

This project aims to build a predictive model to estimate a person's annual income based on various features such as employment details, education, family structure, and more. Using PySpark for data preprocessing, feature engineering, and model building, the project explores both Linear Regression and Random Forest Regressor for predicting the Annual_income.

The dataset, Credit_card.csv, contains various attributes related to individuals' financial and demographic information, which is used to build the predictive model.

Technologies:

Python: Programming language used to write the code. PySpark: Framework for big data processing. JDK 8: Java Development Kit required for PySpark. Matplotlib: Library for plotting results. Pandas: For working with data in DataFrame format.

Installation:

Follow these steps to get the environment up and running: Install Java 8: Required for Spark. Install JDK 8 Install Spark: Download and install Spark by following these instructions.

Install Python dependencies:

bash Copy Edit pip install pyspark matplotlib pandas

Download the dataset:

Ensure you have the Credit_card.csv dataset in the same directory as the script or update the file path in the code accordingly. Dataset Description

The dataset Credit_card.csv contains various attributes related to individuals, including:

Type_Occupation: Occupation type (e.g., business, student, retired).

GENDER: Gender of the individual.

Car_Owner: Whether the individual owns a car (Y/N).

Propert_Owner: Whether the individual owns property (Y/N).

Housing_type: Type of housing (e.g., apartment, house).

Annual_income: Annual income of the individual (target variable).

Family_Members: Number of family members supported by the individual.

Employed_days: The number of days the individual has been employed.

Education: Education level (e.g., higher education, academic degree).

Marital_status: Marital status (e.g., married, single).

Steps to Run the Code:

Setup Environment: Make sure you've installed Java 8, Spark, and Python libraries as described above.

Run the Script:

Open your terminal or command prompt.

Navigate to the directory containing the Python script and dataset.

Execute the script using Python:

bash Copy Edit python credit_score_prediction.py Data Preprocessing and Feature Engineering

The data preprocessing steps include:

Handling Missing Data: Missing values in Type_Occupation are filled with 'NA', and other rows are dropped if any nulls are present in critical columns.

Feature Transformation:

Negative Values: Absolute values are used for Birthday_count and Employed_days to ensure positive values for age and employment.

Age Calculation: The age is calculated using the Birthday_count field and today's date.

Income per Family Member: For those with family members, the Annual_income is divided by the Family_Members column.

Categorical Features Encoding:

Categorical columns like GENDER, Car_Owner, Propert_Owner, and others are encoded using StringIndexer and OneHotEncoder.

Feature Selection: The final model includes features like Income_per_Family_Member, Age, FR_score, and others.

Model Building and Evaluation

Two machine learning models are implemented and evaluated:

Simple Linear Regression (SLR):

The LinearRegression model is trained on the processed features.

Evaluation metric: Root Mean Squared Error (RMSE).

Random Forest Regressor (RF):

The RandomForestRegressor model is trained with 100 trees and a depth of 5.

Evaluation metric: RMSE.

Both models are evaluated on the test set, and the predicted annual income is compared with the actual values.

Results

After training both models, the following results were obtained:

Linear Regression RMSE:

plaintext Copy Edit Root Mean Squared Error (RMSE) on test data for Simple Linear Regression = 55527.11877933004 Random Forest RMSE: The evaluation results of the Random Forest model were visually plotted, showing the correlation between actual and predicted annual income.

Licenses and Credits

PySpark: The project uses the PySpark framework for data processing and machine learning.

Matplotlib: Used for plotting the predictions vs actual values.

Pandas: Used for data manipulation and conversion to DataFrame for ease of handling.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ML_CC_Project_file.ipynb		ML_CC_Project_file.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

aldo2287/Machine-Learning-Income-Credit-Valuation-and-Visualization-Model-Training

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages