|
| 1 | +# Fact-Check Classifier Training |
| 2 | + |
| 3 | +Fine-tune a ModernBERT/BERT model with LoRA to classify prompts as `FACT_CHECK_NEEDED` or `NO_FACT_CHECK_NEEDED`. |
| 4 | + |
| 5 | +## Quick Start on Training |
| 6 | + |
| 7 | +### Prerequisites |
| 8 | + |
| 9 | +```bash |
| 10 | +# Install dependencies |
| 11 | +pip install transformers datasets peft torch accelerate scikit-learn |
| 12 | +``` |
| 13 | + |
| 14 | +### Optional: Pre-download datasets |
| 15 | + |
| 16 | +```bash |
| 17 | +# Pre-download script-based datasets for faster training |
| 18 | +./setup_datasets.sh ./datasets_cache |
| 19 | +``` |
| 20 | + |
| 21 | +### Training Command |
| 22 | + |
| 23 | +```bash |
| 24 | +# Train with full dataset (50k samples, ~10 min on GPU) |
| 25 | +python fact_check_bert_finetuning_lora.py \ |
| 26 | + --mode train \ |
| 27 | + --model modernbert-base \ |
| 28 | + --max-samples 50000 \ |
| 29 | + --epochs 3 \ |
| 30 | + --batch-size 32 \ |
| 31 | + --data-dir ./datasets_cache |
| 32 | +``` |
| 33 | + |
| 34 | +## Training Options |
| 35 | + |
| 36 | +| Option | Default | Description | |
| 37 | +|--------|---------|-------------| |
| 38 | +| `--model` | bert-base-uncased | Model: `modernbert-base`, `bert-base-uncased`, `roberta-base` | |
| 39 | +| `--max-samples` | 2000 | Total samples (50000 for full training) | |
| 40 | +| `--epochs` | 5 | Training epochs (3 is usually sufficient) | |
| 41 | +| `--batch-size` | 16 | Batch size (32 for faster training with enough VRAM) | |
| 42 | +| `--lora-rank` | 16 | LoRA rank | |
| 43 | +| `--data-dir` | None | Path to cached datasets from `setup_datasets.sh` | |
| 44 | + |
| 45 | +## Testing |
| 46 | + |
| 47 | +### Test Command |
| 48 | + |
| 49 | +```bash |
| 50 | +# Test the trained model |
| 51 | +python fact_check_bert_finetuning_lora.py \ |
| 52 | + --mode test \ |
| 53 | + --model-path lora_fact_check_classifier_modernbert-base_r16_model_rust |
| 54 | +``` |
| 55 | + |
| 56 | +## Datasets Used |
| 57 | + |
| 58 | +### FACT_CHECK_NEEDED |
| 59 | + |
| 60 | +- **NISQ-ISQ** - Information-Seeking Questions (Gold standard dataset, ACL LREC 2024) |
| 61 | +- **HaluEval** - QA questions from hallucination benchmark (ACL EMNLP 2023) |
| 62 | +- **FaithDial** - Information-seeking dialogue questions (TACL 2022) |
| 63 | +- **FactCHD** - Fact-conflicting hallucination queries (Chen et al., 2024) |
| 64 | +- **RAG** - Questions for retrieval-augmented generation (neural-bridge/rag-dataset-12000) |
| 65 | +- **SQuAD** - Stanford Question Answering Dataset (100k+ Wikipedia fact questions) |
| 66 | +- **TriviaQA** - Factual trivia questions (650k question-answer-evidence triples) |
| 67 | +- **TruthfulQA** - High-risk factual queries about common misconceptions |
| 68 | +- **HotpotQA** - Multi-hop factual reasoning questions |
| 69 | +- **CoQA** - Conversational factual questions (127k questions across domains) |
| 70 | +- **QASPER** - Information-seeking questions over research papers (NAACL 2021) |
| 71 | +- **ELI5** - Explain Like I'm 5 - factual explanation questions |
| 72 | +- **Natural Questions** - Google Natural Questions (real user queries) |
| 73 | + |
| 74 | +### NO_FACT_CHECK_NEEDED |
| 75 | + |
| 76 | +- **NISQ-NonISQ** - Non-Information-Seeking Questions (Gold standard dataset) |
| 77 | +- **Dolly** - Creative writing, brainstorming, opinion (helps with edge cases) |
| 78 | +- **WritingPrompts** - Creative writing prompts from Reddit (300k prompts) |
| 79 | +- **Alpaca** - Non-factual instructions (coding, creative, math, opinion) |
| 80 | +- **CodeSearchNet** - Programming/technical requests (code documentation) |
0 commit comments