Skip to content

NLP-Challenges/Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classification 🤖📥

This repository hosts the code for the Text Classifier used in our Study Bot Implementation.

The BERT-based classifiers are published under the nlpchallenges organization on Hugging Face. You can find the models specific to this project at nlpchallenges/Text-Classification and nlpchallenges/Text-Classification-Synthetic-Dataset.

Repository Structure

The structure of this repository is organized as follows:

└── 📁Text-Classification
    └── README.md
    └── requirements.txt
    └── 📁assets [ℹ️ Assets used in README.md/Notebooks]
    └── 📁classifier-for-bot [ℹ️ The Classifier built for the Data Chatbot]
        └── 📁data
        └── 📁src
            └── BERT-classification-synthetic.ipynb [ℹ️ Notebook for training BERT-based classifier on the synthetic dataset]
            └── build_synthetic_dataset.py [ℹ️ Script to build the synthetic dataset for the classifier]
    └── 📁data
    └── 📁src
        └── 📁poc [Proof of Concept for each of the 3 classifier variants]
        └── build_custom_dataset.py 
        └── build_dataset.py [ℹ️ Script to build the dataset for the classifiers]
        └── classification.ipynb [ℹ️ Notebook for 3 classifier experiments (NPR MC1)]

Setup

Prerequisites

  1. Clone the Repository: Clone this repository to your local machine.
  2. Python Environment: Create a new virtual Python environment to ensure isolated package management (we used Python 3.11.6).
  3. Installation: Navigate to the repository's root directory and install the required packages:
    pip install -r requirements.txt

Testing Different Classifiers

First, we implemented three different classifier approaches and tested their performance on our constructed dataset. The three approaches are:

  • Linear-SVC
  • LSTM-CNN
  • BERT

For details on the implementation and the results, please refer to the classification.ipynb notebook.

Building the Classifier for the Data Chatbot

Based on the results obtained comparing the three approaches, we decided to use BERT as the classifier for our Data Chatbot. We trained the classifier on a synthetic dataset, which we constructed using the build_synthetic_dataset.py script. The notebook BERT-classification-synthetic.ipynb contains the code for training the classifier on the synthetic dataset.

The files hf_concern.txt, hf_question.txt, and hf_harm.txt can be used to add further examples that failed during production use to the training data. The files are part of the training data additionally to the synthetic dataset, when the classifier is trained.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •