Author: Khanh Le D.
- Overview
- Business Objective
- Dataset
- Installation
- Project Pipeline
- Methodology
- Results
- Conclusion
- Key Findings
This project develops a Semantic Classification model for detecting fake news articles. The model uses the Word2Vec method to extract semantic relations from text and applies supervised machine learning algorithms to classify news articles as either true or fake.
The project demonstrates how understanding textual meaning (rather than just syntax) plays a critical role in making accurate decisions for misinformation detection.
The spread of fake news has become a significant challenge in today's digital world. With the massive volume of news articles published daily, it's becoming increasingly difficult to distinguish between credible and misleading information.
Goal: Build an automated system that can classify news articles as either fake or true, helping to:
- Reduce misinformation spread
- Protect public trust
- Enable efficient decision-making at scale
The project uses two datasets containing news articles:
| Dataset | Description | Records |
|---|---|---|
True.csv |
Verified true news articles | 21,417 |
Fake.csv |
Fake/misleading news articles | 23,502 |
| Total | Combined dataset | 44,919 |
Each dataset contains three columns:
- title: Title of the news article
- text: Full text content of the news article
- date: Date of article publication
After preprocessing (removing null values), the final dataset contains 44,898 records.
pip install numpy==1.26.4
pip install pandas==2.2.2
pip install nltk==3.9.1
pip install spacy==3.7.5
pip install scipy==1.12
pip install pydantic==2.10.5
pip install wordcloud==1.9.4
pip install scikit-learn
pip install gensim
pip install matplotlib
pip install seaborn
pip install plotly
# Download spaCy English model
python -m spacy download en_core_web_smimport nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')1. Data Preparation
│
▼
2. Text Preprocessing
│
▼
3. Train-Validation Split (70/30)
│
▼
4. Exploratory Data Analysis (EDA)
│
▼
5. Feature Extraction (Word2Vec)
│
▼
6. Model Training & Evaluation
│
▼
7. Conclusion & Best Model Selection
- Loaded
True.csvandFake.csvdatasets - Added
news_labelcolumn (1 = True, 0 = Fake) - Merged both DataFrames
- Handled null values (dropped 42 rows with missing data)
- Combined
titleandtextinto singlenews_textcolumn
Text cleaning operations performed:
- Convert text to lowercase
- Remove text in square brackets
- Remove punctuation
- Remove words containing numbers
- POS Tagging: Keep only nouns (NN, NNS tags)
- Lemmatization: Reduce words to base form
- Stopword Removal: Filter out common English stopwords
def clean_text(text):
text = text.lower()
text = re.sub(r'\[.*?\]', '', text)
text = text.translate(str.maketrans('', '', string.punctuation))
text = ' '.join([word for word in text.split() if not re.search(r'\d', word)])
return text- Training Set: 70% (31,428 samples)
- Validation Set: 30% (13,470 samples)
- Stratified split to maintain class distribution
- Visualized character lengths of cleaned vs. lemmatized text
- Observed significant reduction in text length after lemmatization
True News:
- Dominant words: state, government, trump, president, year, official, policy, country, people, party, election
Fake News:
- Dominant words: trump, people, president, state, clinton, time, year, news, image, obama, america, government
Top Unigrams in True News:
| Rank | Unigram | Frequency |
|---|---|---|
| 1 | trump | 33,433 |
| 2 | state | 25,471 |
| 3 | president | 19,238 |
| 4 | reuters | 16,626 |
| 5 | government | 13,963 |
Top Unigrams in Fake News:
| Rank | Unigram | Frequency |
|---|---|---|
| 1 | trump | 46,879 |
| 2 | president | 18,957 |
| 3 | people | 18,319 |
| 4 | state | 14,733 |
| 5 | clinton | 12,589 |
Top Bigrams:
- True News: "donald trump", "barack obama", "washington reuters", "president barack"
- Fake News: "donald trump", "president trump", "president obama", "trump campaign"
Top Trigrams:
- True News: "president barack obama", "president donald trump", "washington reuters president"
- Fake News: "news century wire", "president barack obama", "donald trump realdonaldtrump"
Used Google's pre-trained Word2Vec model (word2vec-google-news-300):
- 300-dimensional word vectors
- Captures semantic relationships between words
- Document vectors created by averaging word vectors
def get_word2vec_features(text_series, model):
features = []
for text in text_series:
words = [word for word in text.split() if word in model.key_to_index]
if words:
features.append(np.mean(model[words], axis=0))
else:
features.append(np.zeros(model.vector_size))
return np.array(features)Feature Dimensions:
- X_train: (31,428, 300)
- X_val: (13,470, 300)
Three supervised learning models were trained and evaluated:
- Logistic Regression (solver='liblinear')
- Decision Tree Classifier
- Random Forest Classifier
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Logistic Regression | 0.9333 | 0.9246 | 0.9365 | 0.9305 |
| Decision Tree | 0.8444 | 0.8477 | 0.8213 | 0.8343 |
| Random Forest | 0.9307 | 0.9371 | 0.9163 | 0.9266 |
Classification Report:
precision recall f1-score support
0 0.94 0.93 0.94 7045
1 0.92 0.94 0.93 6425
accuracy 0.93 13470
macro avg 0.93 0.93 0.93 13470
weighted avg 0.93 0.93 0.93 13470
Performance Highlights:
- Achieved 93.33% accuracy on validation data
- F1-Score of 0.9305 - balanced performance between precision and recall
- Strong performance for both classes (True and Fake news)
True News Characteristics:
- Focus on official political entities and governmental processes
- Frequently sourced from established news agencies (Reuters)
- Common terms: government officials, state departments, official policies
- Bigrams/trigrams reference formal political figures and institutions
Fake News Characteristics:
- More emphasis on individuals and broader societal impacts
- Higher frequency of sensational or visually-driven terms ("image", "video")
- Terms like "people", "time", "news" appear more frequently
- Often focuses on political figures in a more personal/opinionated context
The Word2Vec approach proved effective because:
- Captures contextual meaning - not just keyword presence
- Understands semantic relationships between words
- Identifies subtle patterns in how language is used differently in true vs. fake news
- Robust to vocabulary variations - similar concepts mapped to similar vectors
Logistic Regression was selected as the best model based on:
- Highest Accuracy (93.33%) among all models
- Best F1-Score (0.9305) - crucial for balanced fake news detection
- Balanced Precision & Recall - minimizes both false positives and false negatives
- Computational Efficiency - faster training and inference than ensemble methods
This automated fake news detection system can:
- Significantly reduce misinformation spread
- Help users make more informed decisions
- Contribute to a healthier online information ecosystem
- Scale to handle large volumes of news articles efficiently
- Word2Vec semantic features effectively capture the linguistic differences between true and fake news
- Logistic Regression outperforms more complex models (Decision Tree, Random Forest) for this task
- Text preprocessing (lemmatization, POS tagging for nouns) significantly improves model performance
- True news tends to use more formal, institutional language while fake news uses more sensational, personal language
- The model achieves >93% accuracy with balanced performance across both classes
├── True.csv # True news dataset
├── Fake.csv # Fake news dataset
├── clean_df.csv # Preprocessed data (optional save)
├── Fake_News_Detection_LeDuyKhanh.pdf # Full notebook/report
└── README.md # This file
- Experiment with BERT or Transformer-based embeddings for better semantic understanding
- Implement deep learning models (LSTM, CNN) for text classification
- Add ensemble methods combining multiple approaches
- Include additional features (source credibility, publication patterns)
- Deploy as a real-time API for news verification
- Word2Vec: Google's Word2Vec
- Dataset: True.csv and Fake.csv news article collections
- Libraries: scikit-learn, gensim, NLTK, spaCy
License: Academic/Educational Use
Contact: Khanh Le D.