Fake News Detection using Semantic Classification

Author: Khanh Le D.

Overview

This project develops a Semantic Classification model for detecting fake news articles. The model uses the Word2Vec method to extract semantic relations from text and applies supervised machine learning algorithms to classify news articles as either true or fake.

The project demonstrates how understanding textual meaning (rather than just syntax) plays a critical role in making accurate decisions for misinformation detection.

Business Objective

The spread of fake news has become a significant challenge in today's digital world. With the massive volume of news articles published daily, it's becoming increasingly difficult to distinguish between credible and misleading information.

Goal: Build an automated system that can classify news articles as either fake or true, helping to:

Reduce misinformation spread
Protect public trust
Enable efficient decision-making at scale

Dataset

The project uses two datasets containing news articles:

Dataset	Description	Records
`True.csv`	Verified true news articles	21,417
`Fake.csv`	Fake/misleading news articles	23,502
Total	Combined dataset	44,919

Data Dictionary

Each dataset contains three columns:

title: Title of the news article
text: Full text content of the news article
date: Date of article publication

After preprocessing (removing null values), the final dataset contains 44,898 records.

Installation

Required Libraries

pip install numpy==1.26.4
pip install pandas==2.2.2
pip install nltk==3.9.1
pip install spacy==3.7.5
pip install scipy==1.12
pip install pydantic==2.10.5
pip install wordcloud==1.9.4
pip install scikit-learn
pip install gensim
pip install matplotlib
pip install seaborn
pip install plotly

# Download spaCy English model
python -m spacy download en_core_web_sm

NLTK Downloads

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

Project Pipeline

1. Data Preparation
       │
       ▼
2. Text Preprocessing
       │
       ▼
3. Train-Validation Split (70/30)
       │
       ▼
4. Exploratory Data Analysis (EDA)
       │
       ▼
5. Feature Extraction (Word2Vec)
       │
       ▼
6. Model Training & Evaluation
       │
       ▼
7. Conclusion & Best Model Selection

Methodology

1. Data Preparation

Loaded True.csv and Fake.csv datasets
Added news_label column (1 = True, 0 = Fake)
Merged both DataFrames
Handled null values (dropped 42 rows with missing data)
Combined title and text into single news_text column

2. Text Preprocessing

Text cleaning operations performed:

Convert text to lowercase
Remove text in square brackets
Remove punctuation
Remove words containing numbers
POS Tagging: Keep only nouns (NN, NNS tags)
Lemmatization: Reduce words to base form
Stopword Removal: Filter out common English stopwords

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ' '.join([word for word in text.split() if not re.search(r'\d', word)])
    return text

3. Train-Validation Split

Training Set: 70% (31,428 samples)
Validation Set: 30% (13,470 samples)
Stratified split to maintain class distribution

4. Exploratory Data Analysis

Character Length Distribution

Visualized character lengths of cleaned vs. lemmatized text
Observed significant reduction in text length after lemmatization

Word Cloud Analysis (Top 40 Words)

True News:

Dominant words: state, government, trump, president, year, official, policy, country, people, party, election

Fake News:

Dominant words: trump, people, president, state, clinton, time, year, news, image, obama, america, government

N-gram Analysis

Top Unigrams in True News:

Rank	Unigram	Frequency
1	trump	33,433
2	state	25,471
3	president	19,238
4	reuters	16,626
5	government	13,963

Top Unigrams in Fake News:

Rank	Unigram	Frequency
1	trump	46,879
2	president	18,957
3	people	18,319
4	state	14,733
5	clinton	12,589

Top Bigrams:

True News: "donald trump", "barack obama", "washington reuters", "president barack"
Fake News: "donald trump", "president trump", "president obama", "trump campaign"

Top Trigrams:

True News: "president barack obama", "president donald trump", "washington reuters president"
Fake News: "news century wire", "president barack obama", "donald trump realdonaldtrump"

5. Feature Extraction (Word2Vec)

Used Google's pre-trained Word2Vec model (word2vec-google-news-300):

300-dimensional word vectors
Captures semantic relationships between words
Document vectors created by averaging word vectors

def get_word2vec_features(text_series, model):
    features = []
    for text in text_series:
        words = [word for word in text.split() if word in model.key_to_index]
        if words:
            features.append(np.mean(model[words], axis=0))
        else:
            features.append(np.zeros(model.vector_size))
    return np.array(features)

Feature Dimensions:

X_train: (31,428, 300)
X_val: (13,470, 300)

6. Model Training

Three supervised learning models were trained and evaluated:

Logistic Regression (solver='liblinear')
Decision Tree Classifier
Random Forest Classifier

Results

Model Performance Comparison

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	0.9333	0.9246	0.9365	0.9305
Decision Tree	0.8444	0.8477	0.8213	0.8343
Random Forest	0.9307	0.9371	0.9163	0.9266

Best Model: Logistic Regression

Classification Report:

              precision    recall  f1-score   support

           0       0.94      0.93      0.94      7045
           1       0.92      0.94      0.93      6425

    accuracy                           0.93     13470
   macro avg       0.93      0.93      0.93     13470
weighted avg       0.93      0.93      0.93     13470

Performance Highlights:

Achieved 93.33% accuracy on validation data
F1-Score of 0.9305 - balanced performance between precision and recall
Strong performance for both classes (True and Fake news)

Conclusion

Patterns Observed

True News Characteristics:

Focus on official political entities and governmental processes
Frequently sourced from established news agencies (Reuters)
Common terms: government officials, state departments, official policies
Bigrams/trigrams reference formal political figures and institutions

Fake News Characteristics:

More emphasis on individuals and broader societal impacts
Higher frequency of sensational or visually-driven terms ("image", "video")
Terms like "people", "time", "news" appear more frequently
Often focuses on political figures in a more personal/opinionated context

Why Semantic Classification Works

The Word2Vec approach proved effective because:

Captures contextual meaning - not just keyword presence
Understands semantic relationships between words
Identifies subtle patterns in how language is used differently in true vs. fake news
Robust to vocabulary variations - similar concepts mapped to similar vectors

Best Model Selection Rationale

Logistic Regression was selected as the best model based on:

Highest Accuracy (93.33%) among all models
Best F1-Score (0.9305) - crucial for balanced fake news detection
Balanced Precision & Recall - minimizes both false positives and false negatives
Computational Efficiency - faster training and inference than ensemble methods

Impact

This automated fake news detection system can:

Significantly reduce misinformation spread
Help users make more informed decisions
Contribute to a healthier online information ecosystem
Scale to handle large volumes of news articles efficiently

Key Findings

Word2Vec semantic features effectively capture the linguistic differences between true and fake news
Logistic Regression outperforms more complex models (Decision Tree, Random Forest) for this task
Text preprocessing (lemmatization, POS tagging for nouns) significantly improves model performance
True news tends to use more formal, institutional language while fake news uses more sensational, personal language
The model achieves >93% accuracy with balanced performance across both classes

File Structure

├── True.csv                    # True news dataset
├── Fake.csv                    # Fake news dataset
├── clean_df.csv                # Preprocessed data (optional save)
├── Fake_News_Detection_LeDuyKhanh.pdf  # Full notebook/report
└── README.md                   # This file

Future Improvements

Experiment with BERT or Transformer-based embeddings for better semantic understanding
Implement deep learning models (LSTM, CNN) for text classification
Add ensemble methods combining multiple approaches
Include additional features (source credibility, publication patterns)
Deploy as a real-time API for news verification

References

Word2Vec: Google's Word2Vec
Dataset: True.csv and Fake.csv news article collections
Libraries: scikit-learn, gensim, NLTK, spaCy

License: Academic/Educational Use

Contact: Khanh Le D.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Fake_News_Detection-report.pdf		Fake_News_Detection-report.pdf
Fake_News_Detection.ipynb		Fake_News_Detection.ipynb
README.md		README.md
preview.pdf		preview.pdf

khanhney/semantic-classification

Folders and files

Latest commit

History

Repository files navigation