This project aims to cluster jokes into different categories using unsupervised learning techniques.
In this project, I performed the following tasks:
- Data Cleaning and Preprocessing
- Feature Extraction using TF-IDF Vectorizer
- Clustering using KMeans Algorithm
- Visualization using Network Graph and Parallel Coordinates
The dataset used in this project can be found in the dataset.csv. It contains a list of jokes in plain text format, thanks to Arya Shah for building this dataset.
The project was implemented using Python 3.9.
To run the project, simply run the your-dad-joked-once.ipynb file in the root directory. This will preprocess the data, perform clustering and topic modeling, and generate visualizations.
or simply click on this link and give the notebook a run: https://www.kaggle.com/code/nihirshah/your-dad-joked-once upvote if you believe in god and upvote if you don't.
The KMeans algorithm was able to cluster the jokes into 5 different clusters of different atributes and features.
- Explore other clustering algorithms and compare their performance
- Topic Modeling using Latent Dirichlet Allocation (LDA)
- Expand the dataset to include more jokes and categories