|
| 1 | +# Dingo Factuality Assessment - Complete Guide |
| 2 | + |
| 3 | +This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content. |
| 4 | + |
| 5 | +## 🎯 Feature Overview |
| 6 | + |
| 7 | +Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for: |
| 8 | + |
| 9 | +- **Content Quality Control**: Verify accuracy of generated content |
| 10 | +- **Knowledge Base Validation**: Ensure knowledge base information is accurate |
| 11 | +- **Training Data Filtering**: Filter out factually incorrect training samples |
| 12 | +- **Real-time Output Verification**: Check factual accuracy of model outputs |
| 13 | + |
| 14 | +## 🔧 Core Principles |
| 15 | + |
| 16 | +### Evaluation Process |
| 17 | + |
| 18 | +1. **Claim Extraction**: Break down response into independent factual claims |
| 19 | +2. **Fact Verification**: Verify each claim against reference materials or knowledge base |
| 20 | +3. **Score Calculation**: Calculate overall factuality score |
| 21 | +4. **Issue Identification**: Identify specific factual errors |
| 22 | + |
| 23 | +### Scoring Mechanism |
| 24 | + |
| 25 | +- **Score Range**: 0.0 - 10.0 |
| 26 | +- **Score Meaning**: |
| 27 | + - 8.0-10.0 = High factual accuracy |
| 28 | + - 5.0-7.9 = Moderate accuracy, some errors |
| 29 | + - 0.0-4.9 = Low accuracy, significant errors |
| 30 | +- **Default Threshold**: 5.0 (configurable) |
| 31 | + |
| 32 | +## 📋 Usage Requirements |
| 33 | + |
| 34 | +### Data Format Requirements |
| 35 | + |
| 36 | +```python |
| 37 | +from dingo.io.input import Data |
| 38 | + |
| 39 | +data = Data( |
| 40 | + data_id="test_1", |
| 41 | + prompt="User's question", # Original question (optional) |
| 42 | + content="LLM's response", # Response to assess |
| 43 | + context=["Reference material 1", "Reference material 2"] # Reference materials (optional but recommended) |
| 44 | +) |
| 45 | +``` |
| 46 | + |
| 47 | +## 🚀 Quick Start |
| 48 | + |
| 49 | +### SDK Mode - Single Assessment |
| 50 | + |
| 51 | +```python |
| 52 | +import os |
| 53 | +from dingo.config.input_args import EvaluatorLLMArgs |
| 54 | +from dingo.io.input import Data |
| 55 | +from dingo.model.llm.llm_factcheck import LLMFactCheck |
| 56 | + |
| 57 | +# Configure LLM |
| 58 | +LLMFactCheck.dynamic_config = EvaluatorLLMArgs( |
| 59 | + key=os.getenv("OPENAI_API_KEY"), |
| 60 | + api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), |
| 61 | + model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), |
| 62 | + parameters={"threshold": 5.0} |
| 63 | +) |
| 64 | + |
| 65 | +# Prepare data |
| 66 | +data = Data( |
| 67 | + data_id="test_1", |
| 68 | + prompt="When was Python released?", |
| 69 | + content="Python was released in 1991 by Guido van Rossum.", |
| 70 | + context=["Python was created by Guido van Rossum.", "Python was first released in 1991."] |
| 71 | +) |
| 72 | + |
| 73 | +# Execute assessment |
| 74 | +result = LLMFactCheck.eval(data) |
| 75 | + |
| 76 | +# View results |
| 77 | +print(f"Score: {result.score}/10") |
| 78 | +print(f"Has issues: {result.status}") # True = below threshold, False = passed |
| 79 | +print(f"Reason: {result.reason[0]}") |
| 80 | +``` |
| 81 | + |
| 82 | +### Dataset Mode - Batch Assessment |
| 83 | + |
| 84 | +```python |
| 85 | +from dingo.config import InputArgs |
| 86 | +from dingo.exec import Executor |
| 87 | + |
| 88 | +input_data = { |
| 89 | + "task_name": "factuality_assessment", |
| 90 | + "input_path": "test/data/responses.jsonl", |
| 91 | + "output_path": "outputs/", |
| 92 | + "dataset": {"source": "local", "format": "jsonl"}, |
| 93 | + "executor": { |
| 94 | + "max_workers": 10, |
| 95 | + "result_save": {"good": True, "bad": True, "all_labels": True} |
| 96 | + }, |
| 97 | + "evaluator": [ |
| 98 | + { |
| 99 | + "fields": { |
| 100 | + "prompt": "question", |
| 101 | + "content": "response", |
| 102 | + "context": "references" |
| 103 | + }, |
| 104 | + "evals": [ |
| 105 | + { |
| 106 | + "name": "LLMFactCheck", |
| 107 | + "config": { |
| 108 | + "model": "gpt-4o-mini", |
| 109 | + "key": "YOUR_API_KEY", |
| 110 | + "api_url": "https://api.openai.com/v1", |
| 111 | + "parameters": {"threshold": 5.0} |
| 112 | + } |
| 113 | + } |
| 114 | + ] |
| 115 | + } |
| 116 | + ] |
| 117 | +} |
| 118 | + |
| 119 | +input_args = InputArgs(**input_data) |
| 120 | +executor = Executor.exec_map["local"](input_args) |
| 121 | +summary = executor.execute() |
| 122 | + |
| 123 | +print(f"Total: {summary.total}") |
| 124 | +print(f"Passed: {summary.num_good}") |
| 125 | +print(f"Issues: {summary.num_bad}") |
| 126 | +print(f"Pass rate: {summary.score}%") |
| 127 | +``` |
| 128 | + |
| 129 | +### Data File Format (JSONL) |
| 130 | + |
| 131 | +```jsonl |
| 132 | +{"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]} |
| 133 | +{"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]} |
| 134 | +``` |
| 135 | + |
| 136 | +## ⚙️ Configuration Options |
| 137 | + |
| 138 | +### Threshold Adjustment |
| 139 | + |
| 140 | +```python |
| 141 | +LLMFactCheck.dynamic_config = EvaluatorLLMArgs( |
| 142 | + key="YOUR_API_KEY", |
| 143 | + api_url="https://api.openai.com/v1", |
| 144 | + model="gpt-4o-mini", |
| 145 | + parameters={"threshold": 5.0} # Range: 0.0-10.0 |
| 146 | +) |
| 147 | +``` |
| 148 | + |
| 149 | +**Threshold Recommendations**: |
| 150 | +- **Strict scenarios** (medical, legal): threshold 7.0-8.0 |
| 151 | +- **General scenarios** (Q&A, documentation): threshold 5.0-6.0 |
| 152 | +- **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0 |
| 153 | + |
| 154 | +### Model Selection |
| 155 | + |
| 156 | +```python |
| 157 | +# Option 1: GPT-4 (highest accuracy, higher cost) |
| 158 | +LLMFactCheck.dynamic_config = EvaluatorLLMArgs( |
| 159 | + model="gpt-4o", |
| 160 | + key="YOUR_API_KEY", |
| 161 | + api_url="https://api.openai.com/v1" |
| 162 | +) |
| 163 | + |
| 164 | +# Option 2: GPT-4o-mini (balanced, recommended) |
| 165 | +LLMFactCheck.dynamic_config = EvaluatorLLMArgs( |
| 166 | + model="gpt-4o-mini", |
| 167 | + key="YOUR_API_KEY", |
| 168 | + api_url="https://api.openai.com/v1" |
| 169 | +) |
| 170 | + |
| 171 | +# Option 3: Alternative LLM (DeepSeek, etc.) |
| 172 | +LLMFactCheck.dynamic_config = EvaluatorLLMArgs( |
| 173 | + model="deepseek-chat", |
| 174 | + key="YOUR_API_KEY", |
| 175 | + api_url="https://api.deepseek.com" |
| 176 | +) |
| 177 | +``` |
| 178 | + |
| 179 | +## 📊 Output Format |
| 180 | + |
| 181 | +### SDK Mode Output |
| 182 | + |
| 183 | +```python |
| 184 | +result = LLMFactCheck.eval(data) |
| 185 | + |
| 186 | +# Basic information |
| 187 | +result.score # Score: 0.0-10.0 |
| 188 | +result.status # Has issues: True (below threshold) / False (passed) |
| 189 | +result.label # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"] |
| 190 | +result.reason # Detailed reasons |
| 191 | +result.metric # Metric name: "LLMFactCheck" |
| 192 | +``` |
| 193 | + |
| 194 | +**Output Example (Passed)**: |
| 195 | +```python |
| 196 | +result.score = 8.5 |
| 197 | +result.status = False # False = passed |
| 198 | +result.label = ["QUALITY_GOOD.FACTCHECK_PASS"] |
| 199 | +result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."] |
| 200 | +``` |
| 201 | + |
| 202 | +**Output Example (Failed)**: |
| 203 | +```python |
| 204 | +result.score = 3.2 |
| 205 | +result.status = True # True = failed |
| 206 | +result.label = ["QUALITY_BAD.FACTCHECK_FAIL"] |
| 207 | +result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"] |
| 208 | +``` |
| 209 | + |
| 210 | +## 🌟 Best Practices |
| 211 | + |
| 212 | +### 1. Provide High-quality Reference Materials |
| 213 | + |
| 214 | +**Good References**: |
| 215 | +```python |
| 216 | +context = [ |
| 217 | + "Python was created by Guido van Rossum and first released in February 1991.", |
| 218 | + "Python is an interpreted, high-level programming language.", |
| 219 | + "Python 2.0 was released in 2000, Python 3.0 was released in 2008." |
| 220 | +] |
| 221 | +``` |
| 222 | + |
| 223 | +**Poor References**: |
| 224 | +```python |
| 225 | +context = [ |
| 226 | + "Python", # Too brief |
| 227 | + "Python is a programming language" # Lacks details |
| 228 | +] |
| 229 | +``` |
| 230 | + |
| 231 | +### 2. Suitable Use Cases |
| 232 | + |
| 233 | +**✅ Suitable for**: |
| 234 | +- Verifiable factual claims (dates, names, numbers, events) |
| 235 | +- Historical facts |
| 236 | +- Technical specifications |
| 237 | +- Statistical data |
| 238 | + |
| 239 | +**❌ Not suitable for**: |
| 240 | +- Subjective opinions |
| 241 | +- Future predictions |
| 242 | +- Creative content |
| 243 | +- Open-ended questions |
| 244 | + |
| 245 | +### 3. Combined Use with Other Metrics |
| 246 | + |
| 247 | +```python |
| 248 | +"evaluator": [ |
| 249 | + { |
| 250 | + "fields": { |
| 251 | + "prompt": "user_input", |
| 252 | + "content": "response", |
| 253 | + "context": "retrieved_contexts" |
| 254 | + }, |
| 255 | + "evals": [ |
| 256 | + {"name": "LLMRAGFaithfulness"}, # Answer faithfulness |
| 257 | + {"name": "LLMFactCheck"}, # Factual accuracy |
| 258 | + {"name": "RuleHallucinationHHEM"} # Hallucination detection |
| 259 | + ] |
| 260 | + } |
| 261 | +] |
| 262 | +``` |
| 263 | + |
| 264 | +### 4. Iterative Optimization |
| 265 | + |
| 266 | +1. **Initial Testing**: Use default threshold (5.0) |
| 267 | +2. **Analyze Results**: Review false positives and false negatives |
| 268 | +3. **Adjust Threshold**: Fine-tune based on business requirements |
| 269 | +4. **Re-validate**: Test with new threshold |
| 270 | + |
| 271 | +## 📈 Metric Comparison |
| 272 | + |
| 273 | +| Metric | Purpose | Score Range | Requires Reference | Best For | |
| 274 | +|--------|---------|-------------|-------------------|----------| |
| 275 | +| **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation | |
| 276 | +| **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations | |
| 277 | +| **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection | |
| 278 | + |
| 279 | +**Recommendations**: |
| 280 | +- **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality |
| 281 | +- **Content generation**: Use Factuality alone |
| 282 | +- **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness |
| 283 | + |
| 284 | +## ❓ FAQ |
| 285 | + |
| 286 | +### Q1: Difference between Factuality and Faithfulness? |
| 287 | + |
| 288 | +- **Factuality**: Verifies if content is factually correct (can use external knowledge) |
| 289 | +- **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship) |
| 290 | + |
| 291 | +### Q2: What if no reference materials provided? |
| 292 | + |
| 293 | +LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results. |
| 294 | + |
| 295 | +### Q3: How to handle domain-specific facts? |
| 296 | + |
| 297 | +1. Provide domain-specific reference materials in `context` |
| 298 | +2. Use domain-specific LLM models |
| 299 | +3. Lower threshold to reduce false positives |
| 300 | + |
| 301 | +### Q4: How to interpret scores? |
| 302 | + |
| 303 | +- **8.0-10.0**: High accuracy, all or most facts verified |
| 304 | +- **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims |
| 305 | +- **3.0-4.9**: Low accuracy, multiple errors |
| 306 | +- **0.0-2.9**: Very low accuracy, serious factual errors |
| 307 | + |
| 308 | +## 📖 Related Documents |
| 309 | + |
| 310 | +- [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md) |
| 311 | +- [Hallucination Detection Guide](hallucination_detection_guide.md) |
| 312 | +- [Response Quality Evaluation](../README.md#evaluation-metrics) |
| 313 | + |
| 314 | +## 📝 Example Scenarios |
| 315 | + |
| 316 | +### Scenario 1: Verify Historical Facts |
| 317 | + |
| 318 | +```python |
| 319 | +data = Data( |
| 320 | + content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.", |
| 321 | + context=[ |
| 322 | + "Python was created by Guido van Rossum.", |
| 323 | + "Python was first released in February 1991.", |
| 324 | + "Guido van Rossum began working on Python at CWI." |
| 325 | + ] |
| 326 | +) |
| 327 | + |
| 328 | +result = LLMFactCheck.eval(data) |
| 329 | +# Expected: High score (>8.0), all facts verified |
| 330 | +``` |
| 331 | + |
| 332 | +### Scenario 2: Detect Factual Errors |
| 333 | + |
| 334 | +```python |
| 335 | +data = Data( |
| 336 | + content="Python was released in 1995 by James Gosling.", # Wrong year and author |
| 337 | + context=[ |
| 338 | + "Python was created by Guido van Rossum.", |
| 339 | + "Python was first released in 1991." |
| 340 | + ] |
| 341 | +) |
| 342 | + |
| 343 | +result = LLMFactCheck.eval(data) |
| 344 | +# Expected: Low score (<4.0), multiple errors detected |
| 345 | +``` |
| 346 | + |
| 347 | +### Scenario 3: Assess Partially Correct Content |
| 348 | + |
| 349 | +```python |
| 350 | +data = Data( |
| 351 | + content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.", |
| 352 | + context=[ |
| 353 | + "Python 3.0 was released on December 3, 2008.", |
| 354 | + "Python 3.0 was not backward compatible with Python 2.x series." |
| 355 | + ] |
| 356 | +) |
| 357 | + |
| 358 | +result = LLMFactCheck.eval(data) |
| 359 | +# Expected: High score (7-9), facts mostly correct with minor imprecisions |
| 360 | +``` |
| 361 | + |
| 362 | +### Scenario 4: Handle Unverifiable Claims |
| 363 | + |
| 364 | +```python |
| 365 | +data = Data( |
| 366 | + content="Python will become the most popular programming language in 2030.", # Future prediction |
| 367 | + context=["Python is currently one of the most popular programming languages."] |
| 368 | +) |
| 369 | + |
| 370 | +result = LLMFactCheck.eval(data) |
| 371 | +# Expected: Moderate score (4-6), future prediction cannot be verified |
| 372 | +``` |
0 commit comments