This project was implemented as a side-project during my graduate program at the University of San Diego (USD).
- Project Status: Completed
This project is about classifying whether the word 'mouse' in a sentence is referring to a mouse as animal or as a computer mouse. The dataset consists of separate texts containing the word 'mouse'. The texts are first preprocessed by NLP techniques such as Tokenization, Lemmatization, and Stop Words Removal. Then they are converted into vectors by utilizing three different word embedding approaches (Count Vectorizer, TF-IDF, N-gram level). The feature vectors are then fed into two classifiers (Logistic Regression and Naive Bayesian). The trained models are used for classifying the validation texts. At last, Ensemble Models are used to combine the best 3 performing models in order to improve the performance.
- NLP
- Word Embedding
- Text Classification
- Machine Learning
- Inferential Statistics
- Python
- NLTK
- Numpy
- Scipy
- Scikit-learn
- Pandas
- Matplotlib
Clone or download the zip file, then, extract the zip file and open the Jupyter notebook file (.ipynb) in the same folder. Finally, open the notebook in the Jupyter of your local computer or on the cloud, and run each cell.
https://www.kaggle.com/werty12121/animal-mouse-vs-computer-mouse-text-dataset#animal.csv