This repository hosts the code for the Text Classifier used in our Study Bot Implementation.
The BERT-based classifiers are published under the nlpchallenges organization on Hugging Face. You can find the models specific to this project at nlpchallenges/Text-Classification and nlpchallenges/Text-Classification-Synthetic-Dataset.
The structure of this repository is organized as follows:
└── 📁Text-Classification
└── README.md
└── requirements.txt
└── 📁assets [ℹ️ Assets used in README.md/Notebooks]
└── 📁classifier-for-bot [ℹ️ The Classifier built for the Data Chatbot]
└── 📁data
└── 📁src
└── BERT-classification-synthetic.ipynb [ℹ️ Notebook for training BERT-based classifier on the synthetic dataset]
└── build_synthetic_dataset.py [ℹ️ Script to build the synthetic dataset for the classifier]
└── 📁data
└── 📁src
└── 📁poc [Proof of Concept for each of the 3 classifier variants]
└── build_custom_dataset.py
└── build_dataset.py [ℹ️ Script to build the dataset for the classifiers]
└── classification.ipynb [ℹ️ Notebook for 3 classifier experiments (NPR MC1)]
- Clone the Repository: Clone this repository to your local machine.
- Python Environment: Create a new virtual Python environment to ensure isolated package management (we used Python 3.11.6).
- Installation: Navigate to the repository's root directory and install the required packages:
pip install -r requirements.txt
First, we implemented three different classifier approaches and tested their performance on our constructed dataset. The three approaches are:
- Linear-SVC
- LSTM-CNN
- BERT
For details on the implementation and the results, please refer to the classification.ipynb notebook.
Based on the results obtained comparing the three approaches, we decided to use BERT as the classifier for our Data Chatbot. We trained the classifier on a synthetic dataset, which we constructed using the build_synthetic_dataset.py script. The notebook BERT-classification-synthetic.ipynb contains the code for training the classifier on the synthetic dataset.
The files hf_concern.txt
, hf_question.txt
, and hf_harm.txt
can be used to add further examples that failed during production use to the training data. The files are part of the training data additionally to the synthetic dataset, when the classifier is trained.