The project aims to predict personality types of individuals using Natural Language Processing (NLP) techniques and Machine Learning (ML) algorithms. The dataset used for training and testing the model is the Myers-Briggs Type Indicator (MBTI) dataset which contains a collection of posts from individuals in the PersonalityCafe forum, along with their corresponding personality types based on the MBTI framework.
The project report can be read here
The dataset was preprocessed by performing the following steps:
- Converting all text to lowercase
- Removing URLs, mentions, special characters, and stop words
- Stemming and lemmatization
- Vectorizing the text using the Term Frequency-Inverse Document Frequency (TF-IDF) technique
The MBTI dataset was imbalanced, with some personality types having a significantly smaller number of samples than others. To handle this, undersampling, oversampling, and SMOTe techniques were used to balance the data.
Three different models were trained on the preprocessed dataset:
- Linear SVC
- SVC
- KNN
- Random Forest
- Multinomial Naive Bayes
- Logistic Regression
A simple web-based graphical user interface (GUI) was built using Flask, which allows users to input a text sample and receive a predicted personality type based on the trained models.
pip install -r requirements.txt
Then run the following command:
python app.py
A Kaggle notebook was created to provide a step-by-step guide for the project. It includes the code, visualizations, and explanations of the various techniques used.
This project was created by:
The dataset used in this project was obtained from Kaggle and can be found here.
If you have any questions or feedback, feel free to open an issue or contact me at:
Email: mohdazeemkhan64@gmail.com
Thank you for checking out this project!
This project is licensed under the MIT License - see the LICENSE file for details.