Skip to content

Commit 249812f

Browse files
yuezhu1rootfs
andauthored
Add fact_check_fine_tuning_lora directory (#810)
- Add LoRA fine-tuning module for binary fact-check classification - Includes training script (fact_check_bert_finetuning_lora.py) with BERT/RoBERTa/ModernBERT support - Add dataset setup script (setup_datasets.sh) for pre-downloading datasets - Add README with usage instructions and dataset documentation - Classifies prompts as FACT_CHECK_NEEDED or NO_FACT_CHECK_NEEDED Signed-off-by: Yue Zhu <[email protected]> Co-authored-by: Huamin Chen <[email protected]>
1 parent b63f368 commit 249812f

File tree

3 files changed

+1743
-0
lines changed

3 files changed

+1743
-0
lines changed
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Fact-Check Classifier Training
2+
3+
Fine-tune a ModernBERT/BERT model with LoRA to classify prompts as `FACT_CHECK_NEEDED` or `NO_FACT_CHECK_NEEDED`.
4+
5+
## Quick Start on Training
6+
7+
### Prerequisites
8+
9+
```bash
10+
# Install dependencies
11+
pip install transformers datasets peft torch accelerate scikit-learn
12+
```
13+
14+
### Optional: Pre-download datasets
15+
16+
```bash
17+
# Pre-download script-based datasets for faster training
18+
./setup_datasets.sh ./datasets_cache
19+
```
20+
21+
### Training Command
22+
23+
```bash
24+
# Train with full dataset (50k samples, ~10 min on GPU)
25+
python fact_check_bert_finetuning_lora.py \
26+
--mode train \
27+
--model modernbert-base \
28+
--max-samples 50000 \
29+
--epochs 3 \
30+
--batch-size 32 \
31+
--data-dir ./datasets_cache
32+
```
33+
34+
## Training Options
35+
36+
| Option | Default | Description |
37+
|--------|---------|-------------|
38+
| `--model` | bert-base-uncased | Model: `modernbert-base`, `bert-base-uncased`, `roberta-base` |
39+
| `--max-samples` | 2000 | Total samples (50000 for full training) |
40+
| `--epochs` | 5 | Training epochs (3 is usually sufficient) |
41+
| `--batch-size` | 16 | Batch size (32 for faster training with enough VRAM) |
42+
| `--lora-rank` | 16 | LoRA rank |
43+
| `--data-dir` | None | Path to cached datasets from `setup_datasets.sh` |
44+
45+
## Testing
46+
47+
### Test Command
48+
49+
```bash
50+
# Test the trained model
51+
python fact_check_bert_finetuning_lora.py \
52+
--mode test \
53+
--model-path lora_fact_check_classifier_modernbert-base_r16_model_rust
54+
```
55+
56+
## Datasets Used
57+
58+
### FACT_CHECK_NEEDED
59+
60+
- **NISQ-ISQ** - Information-Seeking Questions (Gold standard dataset, ACL LREC 2024)
61+
- **HaluEval** - QA questions from hallucination benchmark (ACL EMNLP 2023)
62+
- **FaithDial** - Information-seeking dialogue questions (TACL 2022)
63+
- **FactCHD** - Fact-conflicting hallucination queries (Chen et al., 2024)
64+
- **RAG** - Questions for retrieval-augmented generation (neural-bridge/rag-dataset-12000)
65+
- **SQuAD** - Stanford Question Answering Dataset (100k+ Wikipedia fact questions)
66+
- **TriviaQA** - Factual trivia questions (650k question-answer-evidence triples)
67+
- **TruthfulQA** - High-risk factual queries about common misconceptions
68+
- **HotpotQA** - Multi-hop factual reasoning questions
69+
- **CoQA** - Conversational factual questions (127k questions across domains)
70+
- **QASPER** - Information-seeking questions over research papers (NAACL 2021)
71+
- **ELI5** - Explain Like I'm 5 - factual explanation questions
72+
- **Natural Questions** - Google Natural Questions (real user queries)
73+
74+
### NO_FACT_CHECK_NEEDED
75+
76+
- **NISQ-NonISQ** - Non-Information-Seeking Questions (Gold standard dataset)
77+
- **Dolly** - Creative writing, brainstorming, opinion (helps with edge cases)
78+
- **WritingPrompts** - Creative writing prompts from Reddit (300k prompts)
79+
- **Alpaca** - Non-factual instructions (coding, creative, math, opinion)
80+
- **CodeSearchNet** - Programming/technical requests (code documentation)

0 commit comments

Comments
 (0)