A comprehensive project for fake news classification using various machine learning methods and natural language processing techniques.
1. data_exploration.ipynb- Data exploration and analysis2.1 enhanced_classifier.ipynb- Enhanced classifier implementation2.2 embeddings_classifier.ipynb- Classifier with word embeddings2.3 embeddings_advanced_classifier.ipynb- Advanced classifier with embeddings2.4 enhanced_classifier_minus_reuters.ipynb- Classifier excluding Reuters data2.5 Final_classifier_and_XGBoost.ipynb- Final classifier with XGBoost3. models_comparison.ipynb- Model comparison and evaluation
setup.sh- Environment setup scriptstart_notebook.sh- Jupyter Notebook startup script
- Clone the repository:
git clone <repository-url>
cd project-nlp-challenge- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows- Install dependencies:
pip install -r requirements.txt- Start Jupyter Notebook:
jupyter notebookThis project presents a comprehensive study of fake news classification methods, including:
- Text data analysis and preprocessing
- Feature extraction using TF-IDF
- Word embeddings implementation
- Application of various machine learning algorithms
- Performance comparison of different approaches
- Data Exploration: Comprehensive analysis of the dataset structure and characteristics
- Feature Engineering: Multiple approaches to text feature extraction
- Model Variety: Implementation of traditional ML and advanced techniques
- Performance Evaluation: Detailed comparison of model performance metrics
- Reproducible Research: Well-documented notebooks with clear methodology
The project follows a systematic approach:
- Data Analysis: Understanding the dataset structure and quality
- Preprocessing: Text cleaning, tokenization, and normalization
- Feature Extraction: TF-IDF, word embeddings, and custom features
- Model Training: Multiple algorithms including Logistic Regression, SVM, and XGBoost
- Evaluation: Comprehensive performance metrics and comparison
The project demonstrates various approaches to fake news classification and compares their effectiveness across different metrics including accuracy, precision, recall, and F1-score.
- Python 3.x - Core programming language
- Jupyter Notebook - Interactive development environment
- scikit-learn - Machine learning library
- pandas - Data manipulation and analysis
- numpy - Numerical computing
- matplotlib & seaborn - Data visualization
- XGBoost - Gradient boosting framework
- NLTK - Natural language processing toolkit
- spaCy - Advanced NLP library
The project uses a curated dataset of news articles labeled as real or fake, providing a solid foundation for training and evaluation of classification models.
This is a research project showcasing various NLP and ML techniques for fake news detection. Feel free to explore the notebooks and adapt the methods for your own use cases.
Sergej
This project is for educational and research purposes.