This project classifies news articles into four categories — World, Sports, Business, and Sci/Tech — using the AG News dataset. It features text preprocessing, TF-IDF vectorization, and classification using Logistic Regression and a feedforward Neural Network.
- Text preprocessing: tokenization, stopword removal, lemmatization
- TF-IDF vectorization for feature extraction
- Classification using Logistic Regression and a feedforward Neural Network
- Visualizations: class distribution and word clouds per category
- Model evaluation with accuracy, classification report, and confusion matrix
The goal is to build a classifier to predict the category of a news article based on its title and description.
- Data cleaning and preprocessing include tokenization, stopword removal, and lemmatization.
- Feature extraction using TF-IDF vectorization (uni-grams and bi-grams).
- Classification using:
- Logistic Regression
- Feedforward Neural Network with dropout and L2 regularization
The dataset consists of CSV files (train.csv
and test.csv
) with the following columns:
class_index
: Numeric class label (1 to 4)title
: News article titledescription
: News article description
Class mapping:
class_index | category_name |
---|---|
1 | World |
2 | Sports |
3 | Business |
4 | Sci/Tech |
- Clone the repository:
git clone https://github.com/your-username/news-category-classification.git
cd news-category-classification
2.(Optional) Create and activate a virtual environment:
python -m venv venv
# On Linux/macOS
source venv/bin/activate
# On Windows
venv\Scripts\activate
3.Install the required packages directly with pip:
pip install pandas numpy matplotlib seaborn wordcloud nltk scikit-learn tensorflow
- Update the dataset file paths inside the Python script:
Open the main script file (main.py
or your script filename) and replace the following variables with the paths to your local dataset files:
train_path = r"YOUR_LOCAL_PATH_TO_train.csv"
test_path = r"YOUR_LOCAL_PATH_TO_test.csv"
- Trained with TF-IDF features (up to 10,000 features, uni-grams and bi-grams).
- Maximum iterations set to 1000.
- Uses
scikit-learn
implementation with a fixed random seed for reproducibility.
- Input layer size matches TF-IDF feature size.
- Two hidden layers with 512 and 256 neurons respectively.
- Uses ReLU activation functions.
- Includes Dropout (0.5) and L2 regularization (0.01) to reduce overfitting.
- Output layer with softmax activation for multi-class classification.
- Optimizer: Adam.
- Loss: Sparse categorical crossentropy.
- Early stopping with patience of 3 epochs.
The models produced the following outputs and metrics:
- Class Distribution
- Logistic Regression Accuracy
- Logistic Regression Confusion Matrix
- Neural Network Accuracy Over Epochs
- Neural Network Training History
- Successful Predictions
- Word Cloud - Business
- Word Cloud - Sci/Tech
- Word Cloud - Sports
- Word Cloud - World
- Bar plot for class distribution (
class_distribution.png
). - Word clouds for each category (
wordcloud_<category>.png
). - Confusion matrix heatmap for Logistic Regression (
logreg_confusion_matrix.png
). - Neural network training accuracy and loss over epochs (
nn_training_history.png
).
All visualizations are saved automatically when you run the script.
This project is licensed under the MIT License - see the LICENSE file for details.