Skip to content

khanhney/semantic-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Fake News Detection using Semantic Classification

Author: Khanh Le D.

Table of Contents


Overview

This project develops a Semantic Classification model for detecting fake news articles. The model uses the Word2Vec method to extract semantic relations from text and applies supervised machine learning algorithms to classify news articles as either true or fake.

The project demonstrates how understanding textual meaning (rather than just syntax) plays a critical role in making accurate decisions for misinformation detection.


Business Objective

The spread of fake news has become a significant challenge in today's digital world. With the massive volume of news articles published daily, it's becoming increasingly difficult to distinguish between credible and misleading information.

Goal: Build an automated system that can classify news articles as either fake or true, helping to:

  • Reduce misinformation spread
  • Protect public trust
  • Enable efficient decision-making at scale

Dataset

The project uses two datasets containing news articles:

Dataset Description Records
True.csv Verified true news articles 21,417
Fake.csv Fake/misleading news articles 23,502
Total Combined dataset 44,919

Data Dictionary

Each dataset contains three columns:

  • title: Title of the news article
  • text: Full text content of the news article
  • date: Date of article publication

After preprocessing (removing null values), the final dataset contains 44,898 records.


Installation

Required Libraries

pip install numpy==1.26.4
pip install pandas==2.2.2
pip install nltk==3.9.1
pip install spacy==3.7.5
pip install scipy==1.12
pip install pydantic==2.10.5
pip install wordcloud==1.9.4
pip install scikit-learn
pip install gensim
pip install matplotlib
pip install seaborn
pip install plotly

# Download spaCy English model
python -m spacy download en_core_web_sm

NLTK Downloads

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

Project Pipeline

1. Data Preparation
       │
       ▼
2. Text Preprocessing
       │
       ▼
3. Train-Validation Split (70/30)
       │
       ▼
4. Exploratory Data Analysis (EDA)
       │
       ▼
5. Feature Extraction (Word2Vec)
       │
       ▼
6. Model Training & Evaluation
       │
       ▼
7. Conclusion & Best Model Selection

Methodology

1. Data Preparation

  • Loaded True.csv and Fake.csv datasets
  • Added news_label column (1 = True, 0 = Fake)
  • Merged both DataFrames
  • Handled null values (dropped 42 rows with missing data)
  • Combined title and text into single news_text column

2. Text Preprocessing

Text cleaning operations performed:

  1. Convert text to lowercase
  2. Remove text in square brackets
  3. Remove punctuation
  4. Remove words containing numbers
  5. POS Tagging: Keep only nouns (NN, NNS tags)
  6. Lemmatization: Reduce words to base form
  7. Stopword Removal: Filter out common English stopwords
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ' '.join([word for word in text.split() if not re.search(r'\d', word)])
    return text

3. Train-Validation Split

  • Training Set: 70% (31,428 samples)
  • Validation Set: 30% (13,470 samples)
  • Stratified split to maintain class distribution

4. Exploratory Data Analysis

Character Length Distribution

  • Visualized character lengths of cleaned vs. lemmatized text
  • Observed significant reduction in text length after lemmatization

Word Cloud Analysis (Top 40 Words)

True News:

  • Dominant words: state, government, trump, president, year, official, policy, country, people, party, election

Fake News:

  • Dominant words: trump, people, president, state, clinton, time, year, news, image, obama, america, government

N-gram Analysis

Top Unigrams in True News:

Rank Unigram Frequency
1 trump 33,433
2 state 25,471
3 president 19,238
4 reuters 16,626
5 government 13,963

Top Unigrams in Fake News:

Rank Unigram Frequency
1 trump 46,879
2 president 18,957
3 people 18,319
4 state 14,733
5 clinton 12,589

Top Bigrams:

  • True News: "donald trump", "barack obama", "washington reuters", "president barack"
  • Fake News: "donald trump", "president trump", "president obama", "trump campaign"

Top Trigrams:

  • True News: "president barack obama", "president donald trump", "washington reuters president"
  • Fake News: "news century wire", "president barack obama", "donald trump realdonaldtrump"

5. Feature Extraction (Word2Vec)

Used Google's pre-trained Word2Vec model (word2vec-google-news-300):

  • 300-dimensional word vectors
  • Captures semantic relationships between words
  • Document vectors created by averaging word vectors
def get_word2vec_features(text_series, model):
    features = []
    for text in text_series:
        words = [word for word in text.split() if word in model.key_to_index]
        if words:
            features.append(np.mean(model[words], axis=0))
        else:
            features.append(np.zeros(model.vector_size))
    return np.array(features)

Feature Dimensions:

  • X_train: (31,428, 300)
  • X_val: (13,470, 300)

6. Model Training

Three supervised learning models were trained and evaluated:

  1. Logistic Regression (solver='liblinear')
  2. Decision Tree Classifier
  3. Random Forest Classifier

Results

Model Performance Comparison

Model Accuracy Precision Recall F1-Score
Logistic Regression 0.9333 0.9246 0.9365 0.9305
Decision Tree 0.8444 0.8477 0.8213 0.8343
Random Forest 0.9307 0.9371 0.9163 0.9266

Best Model: Logistic Regression

Classification Report:

              precision    recall  f1-score   support

           0       0.94      0.93      0.94      7045
           1       0.92      0.94      0.93      6425

    accuracy                           0.93     13470
   macro avg       0.93      0.93      0.93     13470
weighted avg       0.93      0.93      0.93     13470

Performance Highlights:

  • Achieved 93.33% accuracy on validation data
  • F1-Score of 0.9305 - balanced performance between precision and recall
  • Strong performance for both classes (True and Fake news)

Conclusion

Patterns Observed

True News Characteristics:

  • Focus on official political entities and governmental processes
  • Frequently sourced from established news agencies (Reuters)
  • Common terms: government officials, state departments, official policies
  • Bigrams/trigrams reference formal political figures and institutions

Fake News Characteristics:

  • More emphasis on individuals and broader societal impacts
  • Higher frequency of sensational or visually-driven terms ("image", "video")
  • Terms like "people", "time", "news" appear more frequently
  • Often focuses on political figures in a more personal/opinionated context

Why Semantic Classification Works

The Word2Vec approach proved effective because:

  1. Captures contextual meaning - not just keyword presence
  2. Understands semantic relationships between words
  3. Identifies subtle patterns in how language is used differently in true vs. fake news
  4. Robust to vocabulary variations - similar concepts mapped to similar vectors

Best Model Selection Rationale

Logistic Regression was selected as the best model based on:

  • Highest Accuracy (93.33%) among all models
  • Best F1-Score (0.9305) - crucial for balanced fake news detection
  • Balanced Precision & Recall - minimizes both false positives and false negatives
  • Computational Efficiency - faster training and inference than ensemble methods

Impact

This automated fake news detection system can:

  • Significantly reduce misinformation spread
  • Help users make more informed decisions
  • Contribute to a healthier online information ecosystem
  • Scale to handle large volumes of news articles efficiently

Key Findings

  1. Word2Vec semantic features effectively capture the linguistic differences between true and fake news
  2. Logistic Regression outperforms more complex models (Decision Tree, Random Forest) for this task
  3. Text preprocessing (lemmatization, POS tagging for nouns) significantly improves model performance
  4. True news tends to use more formal, institutional language while fake news uses more sensational, personal language
  5. The model achieves >93% accuracy with balanced performance across both classes

File Structure

├── True.csv                    # True news dataset
├── Fake.csv                    # Fake news dataset
├── clean_df.csv                # Preprocessed data (optional save)
├── Fake_News_Detection_LeDuyKhanh.pdf  # Full notebook/report
└── README.md                   # This file

Future Improvements

  1. Experiment with BERT or Transformer-based embeddings for better semantic understanding
  2. Implement deep learning models (LSTM, CNN) for text classification
  3. Add ensemble methods combining multiple approaches
  4. Include additional features (source credibility, publication patterns)
  5. Deploy as a real-time API for news verification

References

  • Word2Vec: Google's Word2Vec
  • Dataset: True.csv and Fake.csv news article collections
  • Libraries: scikit-learn, gensim, NLTK, spaCy

License: Academic/Educational Use

Contact: Khanh Le D.

About

A Semantic Classification model for detecting fake news articles

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published