Skip to content

Commit cd29c96

Browse files
committed
update readme
1 parent 252a950 commit cd29c96

File tree

4 files changed

+1181
-3
lines changed

4 files changed

+1181
-3
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -298,9 +298,9 @@ Dingo provides **70+ evaluation metrics** across multiple dimensions, combining
298298
| **Security** | PII detection, Perspective API toxicity | Privacy and safety |
299299

300300
📊 **[View Complete Metrics Documentation →](docs/metrics.md)**
301-
📖 **[RAG Evaluation Guide (中文) →](docs/rag_evaluation_metrics_zh.md)**
302-
🔍 **[Hallucination Detection Guide (中文) →](docs/hallucination_guide.md)**
303-
**[Factuality Assessment Guide (中文) →](docs/factcheck_guide.md)**
301+
📖 **[RAG Evaluation Guide ](docs/rag_evaluation_metrics.md)** | **[中文版](docs/rag_evaluation_metrics_zh.md)**
302+
🔍 **[Hallucination Detection Guide ](docs/hallucination_detection_guide.md)** | **[中文版](docs/hallucination_guide.md)**
303+
**[Factuality Assessment Guide ](docs/factuality_assessment_guide.md)** | **[中文版](docs/factcheck_guide.md)**
304304

305305
Most metrics are backed by academic research to ensure scientific rigor.
306306

Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
# Dingo Factuality Assessment - Complete Guide
2+
3+
This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content.
4+
5+
## 🎯 Feature Overview
6+
7+
Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for:
8+
9+
- **Content Quality Control**: Verify accuracy of generated content
10+
- **Knowledge Base Validation**: Ensure knowledge base information is accurate
11+
- **Training Data Filtering**: Filter out factually incorrect training samples
12+
- **Real-time Output Verification**: Check factual accuracy of model outputs
13+
14+
## 🔧 Core Principles
15+
16+
### Evaluation Process
17+
18+
1. **Claim Extraction**: Break down response into independent factual claims
19+
2. **Fact Verification**: Verify each claim against reference materials or knowledge base
20+
3. **Score Calculation**: Calculate overall factuality score
21+
4. **Issue Identification**: Identify specific factual errors
22+
23+
### Scoring Mechanism
24+
25+
- **Score Range**: 0.0 - 10.0
26+
- **Score Meaning**:
27+
- 8.0-10.0 = High factual accuracy
28+
- 5.0-7.9 = Moderate accuracy, some errors
29+
- 0.0-4.9 = Low accuracy, significant errors
30+
- **Default Threshold**: 5.0 (configurable)
31+
32+
## 📋 Usage Requirements
33+
34+
### Data Format Requirements
35+
36+
```python
37+
from dingo.io.input import Data
38+
39+
data = Data(
40+
data_id="test_1",
41+
prompt="User's question", # Original question (optional)
42+
content="LLM's response", # Response to assess
43+
context=["Reference material 1", "Reference material 2"] # Reference materials (optional but recommended)
44+
)
45+
```
46+
47+
## 🚀 Quick Start
48+
49+
### SDK Mode - Single Assessment
50+
51+
```python
52+
import os
53+
from dingo.config.input_args import EvaluatorLLMArgs
54+
from dingo.io.input import Data
55+
from dingo.model.llm.llm_factcheck import LLMFactCheck
56+
57+
# Configure LLM
58+
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
59+
key=os.getenv("OPENAI_API_KEY"),
60+
api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
61+
model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
62+
parameters={"threshold": 5.0}
63+
)
64+
65+
# Prepare data
66+
data = Data(
67+
data_id="test_1",
68+
prompt="When was Python released?",
69+
content="Python was released in 1991 by Guido van Rossum.",
70+
context=["Python was created by Guido van Rossum.", "Python was first released in 1991."]
71+
)
72+
73+
# Execute assessment
74+
result = LLMFactCheck.eval(data)
75+
76+
# View results
77+
print(f"Score: {result.score}/10")
78+
print(f"Has issues: {result.status}") # True = below threshold, False = passed
79+
print(f"Reason: {result.reason[0]}")
80+
```
81+
82+
### Dataset Mode - Batch Assessment
83+
84+
```python
85+
from dingo.config import InputArgs
86+
from dingo.exec import Executor
87+
88+
input_data = {
89+
"task_name": "factuality_assessment",
90+
"input_path": "test/data/responses.jsonl",
91+
"output_path": "outputs/",
92+
"dataset": {"source": "local", "format": "jsonl"},
93+
"executor": {
94+
"max_workers": 10,
95+
"result_save": {"good": True, "bad": True, "all_labels": True}
96+
},
97+
"evaluator": [
98+
{
99+
"fields": {
100+
"prompt": "question",
101+
"content": "response",
102+
"context": "references"
103+
},
104+
"evals": [
105+
{
106+
"name": "LLMFactCheck",
107+
"config": {
108+
"model": "gpt-4o-mini",
109+
"key": "YOUR_API_KEY",
110+
"api_url": "https://api.openai.com/v1",
111+
"parameters": {"threshold": 5.0}
112+
}
113+
}
114+
]
115+
}
116+
]
117+
}
118+
119+
input_args = InputArgs(**input_data)
120+
executor = Executor.exec_map["local"](input_args)
121+
summary = executor.execute()
122+
123+
print(f"Total: {summary.total}")
124+
print(f"Passed: {summary.num_good}")
125+
print(f"Issues: {summary.num_bad}")
126+
print(f"Pass rate: {summary.score}%")
127+
```
128+
129+
### Data File Format (JSONL)
130+
131+
```jsonl
132+
{"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]}
133+
{"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]}
134+
```
135+
136+
## ⚙️ Configuration Options
137+
138+
### Threshold Adjustment
139+
140+
```python
141+
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
142+
key="YOUR_API_KEY",
143+
api_url="https://api.openai.com/v1",
144+
model="gpt-4o-mini",
145+
parameters={"threshold": 5.0} # Range: 0.0-10.0
146+
)
147+
```
148+
149+
**Threshold Recommendations**:
150+
- **Strict scenarios** (medical, legal): threshold 7.0-8.0
151+
- **General scenarios** (Q&A, documentation): threshold 5.0-6.0
152+
- **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0
153+
154+
### Model Selection
155+
156+
```python
157+
# Option 1: GPT-4 (highest accuracy, higher cost)
158+
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
159+
model="gpt-4o",
160+
key="YOUR_API_KEY",
161+
api_url="https://api.openai.com/v1"
162+
)
163+
164+
# Option 2: GPT-4o-mini (balanced, recommended)
165+
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
166+
model="gpt-4o-mini",
167+
key="YOUR_API_KEY",
168+
api_url="https://api.openai.com/v1"
169+
)
170+
171+
# Option 3: Alternative LLM (DeepSeek, etc.)
172+
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
173+
model="deepseek-chat",
174+
key="YOUR_API_KEY",
175+
api_url="https://api.deepseek.com"
176+
)
177+
```
178+
179+
## 📊 Output Format
180+
181+
### SDK Mode Output
182+
183+
```python
184+
result = LLMFactCheck.eval(data)
185+
186+
# Basic information
187+
result.score # Score: 0.0-10.0
188+
result.status # Has issues: True (below threshold) / False (passed)
189+
result.label # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"]
190+
result.reason # Detailed reasons
191+
result.metric # Metric name: "LLMFactCheck"
192+
```
193+
194+
**Output Example (Passed)**:
195+
```python
196+
result.score = 8.5
197+
result.status = False # False = passed
198+
result.label = ["QUALITY_GOOD.FACTCHECK_PASS"]
199+
result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."]
200+
```
201+
202+
**Output Example (Failed)**:
203+
```python
204+
result.score = 3.2
205+
result.status = True # True = failed
206+
result.label = ["QUALITY_BAD.FACTCHECK_FAIL"]
207+
result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"]
208+
```
209+
210+
## 🌟 Best Practices
211+
212+
### 1. Provide High-quality Reference Materials
213+
214+
**Good References**:
215+
```python
216+
context = [
217+
"Python was created by Guido van Rossum and first released in February 1991.",
218+
"Python is an interpreted, high-level programming language.",
219+
"Python 2.0 was released in 2000, Python 3.0 was released in 2008."
220+
]
221+
```
222+
223+
**Poor References**:
224+
```python
225+
context = [
226+
"Python", # Too brief
227+
"Python is a programming language" # Lacks details
228+
]
229+
```
230+
231+
### 2. Suitable Use Cases
232+
233+
**✅ Suitable for**:
234+
- Verifiable factual claims (dates, names, numbers, events)
235+
- Historical facts
236+
- Technical specifications
237+
- Statistical data
238+
239+
**❌ Not suitable for**:
240+
- Subjective opinions
241+
- Future predictions
242+
- Creative content
243+
- Open-ended questions
244+
245+
### 3. Combined Use with Other Metrics
246+
247+
```python
248+
"evaluator": [
249+
{
250+
"fields": {
251+
"prompt": "user_input",
252+
"content": "response",
253+
"context": "retrieved_contexts"
254+
},
255+
"evals": [
256+
{"name": "LLMRAGFaithfulness"}, # Answer faithfulness
257+
{"name": "LLMFactCheck"}, # Factual accuracy
258+
{"name": "RuleHallucinationHHEM"} # Hallucination detection
259+
]
260+
}
261+
]
262+
```
263+
264+
### 4. Iterative Optimization
265+
266+
1. **Initial Testing**: Use default threshold (5.0)
267+
2. **Analyze Results**: Review false positives and false negatives
268+
3. **Adjust Threshold**: Fine-tune based on business requirements
269+
4. **Re-validate**: Test with new threshold
270+
271+
## 📈 Metric Comparison
272+
273+
| Metric | Purpose | Score Range | Requires Reference | Best For |
274+
|--------|---------|-------------|-------------------|----------|
275+
| **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation |
276+
| **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations |
277+
| **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection |
278+
279+
**Recommendations**:
280+
- **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality
281+
- **Content generation**: Use Factuality alone
282+
- **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness
283+
284+
## ❓ FAQ
285+
286+
### Q1: Difference between Factuality and Faithfulness?
287+
288+
- **Factuality**: Verifies if content is factually correct (can use external knowledge)
289+
- **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship)
290+
291+
### Q2: What if no reference materials provided?
292+
293+
LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results.
294+
295+
### Q3: How to handle domain-specific facts?
296+
297+
1. Provide domain-specific reference materials in `context`
298+
2. Use domain-specific LLM models
299+
3. Lower threshold to reduce false positives
300+
301+
### Q4: How to interpret scores?
302+
303+
- **8.0-10.0**: High accuracy, all or most facts verified
304+
- **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims
305+
- **3.0-4.9**: Low accuracy, multiple errors
306+
- **0.0-2.9**: Very low accuracy, serious factual errors
307+
308+
## 📖 Related Documents
309+
310+
- [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md)
311+
- [Hallucination Detection Guide](hallucination_detection_guide.md)
312+
- [Response Quality Evaluation](../README.md#evaluation-metrics)
313+
314+
## 📝 Example Scenarios
315+
316+
### Scenario 1: Verify Historical Facts
317+
318+
```python
319+
data = Data(
320+
content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.",
321+
context=[
322+
"Python was created by Guido van Rossum.",
323+
"Python was first released in February 1991.",
324+
"Guido van Rossum began working on Python at CWI."
325+
]
326+
)
327+
328+
result = LLMFactCheck.eval(data)
329+
# Expected: High score (>8.0), all facts verified
330+
```
331+
332+
### Scenario 2: Detect Factual Errors
333+
334+
```python
335+
data = Data(
336+
content="Python was released in 1995 by James Gosling.", # Wrong year and author
337+
context=[
338+
"Python was created by Guido van Rossum.",
339+
"Python was first released in 1991."
340+
]
341+
)
342+
343+
result = LLMFactCheck.eval(data)
344+
# Expected: Low score (<4.0), multiple errors detected
345+
```
346+
347+
### Scenario 3: Assess Partially Correct Content
348+
349+
```python
350+
data = Data(
351+
content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.",
352+
context=[
353+
"Python 3.0 was released on December 3, 2008.",
354+
"Python 3.0 was not backward compatible with Python 2.x series."
355+
]
356+
)
357+
358+
result = LLMFactCheck.eval(data)
359+
# Expected: High score (7-9), facts mostly correct with minor imprecisions
360+
```
361+
362+
### Scenario 4: Handle Unverifiable Claims
363+
364+
```python
365+
data = Data(
366+
content="Python will become the most popular programming language in 2030.", # Future prediction
367+
context=["Python is currently one of the most popular programming languages."]
368+
)
369+
370+
result = LLMFactCheck.eval(data)
371+
# Expected: Moderate score (4-6), future prediction cannot be verified
372+
```

0 commit comments

Comments
 (0)