GitHub - pandey-amrit/Toxic-Comments-Classification: Python Project to classy toxic comments

Toxic Comment Classification

This repository contains a comprehensive project on toxic comment classification using both traditional machine learning methods and deep learning approaches. The project aims to detect toxicity in user-generated text comments across multiple categories, including toxic, severe toxic, obscene, threat, insult, and identity hate.

Key Features

Dataset:
- Includes 159,571 comments with multilabel annotations.
- Utilized tokenization and TF-IDF transformation for feature extraction.
Machine Learning Models:
- Multinomial Naive Bayes, Logistic Regression, and Linear SVC.
- Achieved high accuracy, with Linear SVC as the best-performing model.
Voting Classifier:
- Combines predictions from multiple models to improve robustness.
Deep Learning with LSTM:
- Leverages sequential data processing and word embeddings for nuanced toxicity detection.
- Models saved in .h5 and .keras formats for compatibility and future use.
Evaluation Metrics:
- Precision, recall, and accuracy for assessing model performance.
- Explored correlations between toxicity categories for better interpretability.
Data Handling:
- Addressed memory issues using Jupyter Notebook for efficient processing.
- Incorporated sequence padding and embeddings for LSTM input compatibility.

Results

Linear SVC demonstrated a balance between precision and recall, excelling in high-dimensional TF-IDF feature handling.
The LSTM model captured context effectively, aided by word embeddings and sequence padding.

Repository Content

Notebook: The main code in .ipynb format for step-by-step execution.
Dataset: Used for training and validation (not included directly due to size constraints).
Pre-trained Models: Saved in both .h5 and .keras formats.
Report: Detailed analysis of modeling and feature engineering [PDF].

Conclusion

This project highlights the importance of combining feature-based learning with context-aware deep learning techniques to tackle complex challenges in content moderation. The use of both traditional and deep learning models ensures robust and reliable toxicity detection.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
amritdev_project2.ipynb		amritdev_project2.ipynb
correlation matrix.png		correlation matrix.png
linear SVC confusion matrix.png		linear SVC confusion matrix.png
logistic regression confusion matrix.png		logistic regression confusion matrix.png
model accuracy comparision.png		model accuracy comparision.png
multinomial naive bayes confusion matrix.png		multinomial naive bayes confusion matrix.png
pairwaise label correlation.png		pairwaise label correlation.png
testing toxicity comments.png		testing toxicity comments.png
train lable frequence.png		train lable frequence.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Comment Classification

Key Features

Results

Repository Content

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification

Key Features

Results

Repository Content

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages