Marathi Text Classification is a Natural Language Processing (NLP) project designed to automatically classify text written in the Marathi language into predefined categories.
The project uses techniques such as stopword removal, TF-IDF feature extraction, and machine learning models including Naive Bayes, Logistic Regression, SVM, and Random Forest. It also includes a user interface that is made using Streamlit for real-time predictions using the best-performing model.
This project demonstrates NLP workflows using scikit-learn and highlights how to build and deploy a full text classification pipeline for Marathi language.
- Marathi stopword removal and text preprocessing
- Feature extraction using TF-IDF Vectorizer
- Multiple models tested: Naive Bayes, Logistic Regression, SVM, Random Forest
- Final model: Naive Bayes (best performing)
- Frontend: Streamlit used for live text classification
- Models and vectorizer saved using
pickle
- User inputs Marathi text
- Text is cleaned and stopwords are removed
- TF-IDF vectorization is applied
- Trained model predicts the category
- Label encoder returns the final category name
Note: The file marathi_sample.csv
contains 20 synthetic sample rows.
It is intended for demo/testing only and is not the actual dataset used during model training.
- Python 3.13
- Streamlit
- Scikit-Learn
- Pandas
- NLTK (for stopword removal)
- TF-IDF Vectorizer
- Pickle (for model saving/loading)
This project was originally developed as part of a group academic project.
Original Contributors:
- Amil Gauri (Maintainer)
- Vikas Pandit
- Ajay Chaurasiya
Project restructured, documented, and maintained by Amil Gauri for public release.
This project is open-source under the MIT License.
You are free to use, modify, and distribute it with proper credit.
See the LICENSE file for full details.