MigoXLab
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/factuality_assessment_guide.md‎
Lines changed: 372 additions & 0 deletions b/‎docs/factuality_assessment_guide.md‎
Lines changed: 372 additions & 0 deletions
@@ -298,9 +298,9 @@ Dingo provides **70+ evaluation metrics** across multiple dimensions, combining
 | **Security** | PII detection, Perspective API toxicity | Privacy and safety |
 
 📊 **[View Complete Metrics Documentation →](docs/metrics.md)**  
-📖 **[RAG Evaluation Guide (中文) →](docs/rag_evaluation_metrics_zh.md)**  
-🔍 **[Hallucination Detection Guide (中文) →](docs/hallucination_guide.md)**  
-✅ **[Factuality Assessment Guide (中文) →](docs/factcheck_guide.md)**
+📖 **[RAG Evaluation Guide →](docs/rag_evaluation_metrics.md)** | **[中文版](docs/rag_evaluation_metrics_zh.md)**  
+🔍 **[Hallucination Detection Guide →](docs/hallucination_detection_guide.md)** | **[中文版](docs/hallucination_guide.md)**  
+✅ **[Factuality Assessment Guide →](docs/factuality_assessment_guide.md)** | **[中文版](docs/factcheck_guide.md)**
 
 Most metrics are backed by academic research to ensure scientific rigor.
 
 
@@ -0,0 +1,372 @@
+# Dingo Factuality Assessment - Complete Guide
+
+This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content.
+
+## 🎯 Feature Overview
+
+Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for:
+
+- **Content Quality Control**: Verify accuracy of generated content
+- **Knowledge Base Validation**: Ensure knowledge base information is accurate
+- **Training Data Filtering**: Filter out factually incorrect training samples
+- **Real-time Output Verification**: Check factual accuracy of model outputs
+
+## 🔧 Core Principles
+
+### Evaluation Process
+
+1. **Claim Extraction**: Break down response into independent factual claims
+2. **Fact Verification**: Verify each claim against reference materials or knowledge base
+3. **Score Calculation**: Calculate overall factuality score
+4. **Issue Identification**: Identify specific factual errors
+
+### Scoring Mechanism
+
+- **Score Range**: 0.0 - 10.0
+- **Score Meaning**:
+  - 8.0-10.0 = High factual accuracy
+  - 5.0-7.9 = Moderate accuracy, some errors
+  - 0.0-4.9 = Low accuracy, significant errors
+- **Default Threshold**: 5.0 (configurable)
+
+## 📋 Usage Requirements
+
+### Data Format Requirements
+
+```python
+from dingo.io.input import Data
+
+data = Data(
+    data_id="test_1",
+    prompt="User's question",  # Original question (optional)
+    content="LLM's response",  # Response to assess
+    context=["Reference material 1", "Reference material 2"]  # Reference materials (optional but recommended)
+)
+```
+
+## 🚀 Quick Start
+
+### SDK Mode - Single Assessment
+
+```python
+import os
+from dingo.config.input_args import EvaluatorLLMArgs
+from dingo.io.input import Data
+from dingo.model.llm.llm_factcheck import LLMFactCheck
+
+# Configure LLM
+LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
+    key=os.getenv("OPENAI_API_KEY"),
+    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
+    model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
+    parameters={"threshold": 5.0}
+)
+
+# Prepare data
+data = Data(
+    data_id="test_1",
+    prompt="When was Python released?",
+    content="Python was released in 1991 by Guido van Rossum.",
+    context=["Python was created by Guido van Rossum.", "Python was first released in 1991."]
+)
+
+# Execute assessment
+result = LLMFactCheck.eval(data)
+
+# View results
+print(f"Score: {result.score}/10")
+print(f"Has issues: {result.status}")  # True = below threshold, False = passed
+print(f"Reason: {result.reason[0]}")
+```
+
+### Dataset Mode - Batch Assessment
+
+```python
+from dingo.config import InputArgs
+from dingo.exec import Executor
+
+input_data = {
+    "task_name": "factuality_assessment",
+    "input_path": "test/data/responses.jsonl",
+    "output_path": "outputs/",
+    "dataset": {"source": "local", "format": "jsonl"},
+    "executor": {
+        "max_workers": 10,
+        "result_save": {"good": True, "bad": True, "all_labels": True}
+    },
+    "evaluator": [
+        {
+            "fields": {
+                "prompt": "question",
+                "content": "response",
+                "context": "references"
+            },
+            "evals": [
+                {
+                    "name": "LLMFactCheck",
+                    "config": {
+                        "model": "gpt-4o-mini",
+                        "key": "YOUR_API_KEY",
+                        "api_url": "https://api.openai.com/v1",
+                        "parameters": {"threshold": 5.0}
+                    }
+                }
+            ]
+        }
+    ]
+}
+
+input_args = InputArgs(**input_data)
+executor = Executor.exec_map["local"](input_args)
+summary = executor.execute()
+
+print(f"Total: {summary.total}")
+print(f"Passed: {summary.num_good}")
+print(f"Issues: {summary.num_bad}")
+print(f"Pass rate: {summary.score}%")
+```
+
+### Data File Format (JSONL)
+
+```jsonl
+{"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]}
+{"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]}
+```
+
+## ⚙️ Configuration Options
+
+### Threshold Adjustment
+
+```python
+LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
+    key="YOUR_API_KEY",
+    api_url="https://api.openai.com/v1",
+    model="gpt-4o-mini",
+    parameters={"threshold": 5.0}  # Range: 0.0-10.0
+)
+```
+
+**Threshold Recommendations**:
+- **Strict scenarios** (medical, legal): threshold 7.0-8.0
+- **General scenarios** (Q&A, documentation): threshold 5.0-6.0
+- **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0
+
+### Model Selection
+
+```python
+# Option 1: GPT-4 (highest accuracy, higher cost)
+LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
+    model="gpt-4o",
+    key="YOUR_API_KEY",
+    api_url="https://api.openai.com/v1"
+)
+
+# Option 2: GPT-4o-mini (balanced, recommended)
+LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
+    model="gpt-4o-mini",
+    key="YOUR_API_KEY",
+    api_url="https://api.openai.com/v1"
+)
+
+# Option 3: Alternative LLM (DeepSeek, etc.)
+LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
+    model="deepseek-chat",
+    key="YOUR_API_KEY",
+    api_url="https://api.deepseek.com"
+)
+```
+
+## 📊 Output Format
+
+### SDK Mode Output
+
+```python
+result = LLMFactCheck.eval(data)
+
+# Basic information
+result.score          # Score: 0.0-10.0
+result.status         # Has issues: True (below threshold) / False (passed)
+result.label          # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"]
+result.reason         # Detailed reasons
+result.metric         # Metric name: "LLMFactCheck"
+```
+
+**Output Example (Passed)**:
+```python
+result.score = 8.5
+result.status = False  # False = passed
+result.label = ["QUALITY_GOOD.FACTCHECK_PASS"]
+result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."]
+```
+
+**Output Example (Failed)**:
+```python
+result.score = 3.2
+result.status = True  # True = failed
+result.label = ["QUALITY_BAD.FACTCHECK_FAIL"]
+result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"]
+```
+
+## 🌟 Best Practices
+
+### 1. Provide High-quality Reference Materials
+
+**Good References**:
+```python
+context = [
+    "Python was created by Guido van Rossum and first released in February 1991.",
+    "Python is an interpreted, high-level programming language.",
+    "Python 2.0 was released in 2000, Python 3.0 was released in 2008."
+]
+```
+
+**Poor References**:
+```python
+context = [
+    "Python",  # Too brief
+    "Python is a programming language"  # Lacks details
+]
+```
+
+### 2. Suitable Use Cases
+
+**✅ Suitable for**:
+- Verifiable factual claims (dates, names, numbers, events)
+- Historical facts
+- Technical specifications
+- Statistical data
+
+**❌ Not suitable for**:
+- Subjective opinions
+- Future predictions
+- Creative content
+- Open-ended questions
+
+### 3. Combined Use with Other Metrics
+
+```python
+"evaluator": [
+    {
+        "fields": {
+            "prompt": "user_input",
+            "content": "response",
+            "context": "retrieved_contexts"
+        },
+        "evals": [
+            {"name": "LLMRAGFaithfulness"},       # Answer faithfulness
+            {"name": "LLMFactCheck"},             # Factual accuracy
+            {"name": "RuleHallucinationHHEM"}     # Hallucination detection
+        ]
+    }
+]
+```
+
+### 4. Iterative Optimization
+
+1. **Initial Testing**: Use default threshold (5.0)
+2. **Analyze Results**: Review false positives and false negatives
+3. **Adjust Threshold**: Fine-tune based on business requirements
+4. **Re-validate**: Test with new threshold
+
+## 📈 Metric Comparison
+
+| Metric | Purpose | Score Range | Requires Reference | Best For |
+|--------|---------|-------------|-------------------|----------|
+| **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation |
+| **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations |
+| **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection |
+
+**Recommendations**:
+- **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality
+- **Content generation**: Use Factuality alone
+- **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness
+
+## ❓ FAQ
+
+### Q1: Difference between Factuality and Faithfulness?
+
+- **Factuality**: Verifies if content is factually correct (can use external knowledge)
+- **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship)
+
+### Q2: What if no reference materials provided?
+
+LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results.
+
+### Q3: How to handle domain-specific facts?
+
+1. Provide domain-specific reference materials in `context`
+2. Use domain-specific LLM models
+3. Lower threshold to reduce false positives
+
+### Q4: How to interpret scores?
+
+- **8.0-10.0**: High accuracy, all or most facts verified
+- **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims
+- **3.0-4.9**: Low accuracy, multiple errors
+- **0.0-2.9**: Very low accuracy, serious factual errors
+
+## 📖 Related Documents
+
+- [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md)
+- [Hallucination Detection Guide](hallucination_detection_guide.md)
+- [Response Quality Evaluation](../README.md#evaluation-metrics)
+
+## 📝 Example Scenarios
+
+### Scenario 1: Verify Historical Facts
+
+```python
+data = Data(
+    content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.",
+    context=[
+        "Python was created by Guido van Rossum.",
+        "Python was first released in February 1991.",
+        "Guido van Rossum began working on Python at CWI."
+    ]
+)
+
+result = LLMFactCheck.eval(data)
+# Expected: High score (>8.0), all facts verified
+```
+
+### Scenario 2: Detect Factual Errors
+
+```python
+data = Data(
+    content="Python was released in 1995 by James Gosling.",  # Wrong year and author
+    context=[
+        "Python was created by Guido van Rossum.",
+        "Python was first released in 1991."
+    ]
+)
+
+result = LLMFactCheck.eval(data)
+# Expected: Low score (<4.0), multiple errors detected
+```
+
+### Scenario 3: Assess Partially Correct Content
+
+```python
+data = Data(
+    content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.",
+    context=[
+        "Python 3.0 was released on December 3, 2008.",
+        "Python 3.0 was not backward compatible with Python 2.x series."
+    ]
+)
+
+result = LLMFactCheck.eval(data)
+# Expected: High score (7-9), facts mostly correct with minor imprecisions
+```
+
+### Scenario 4: Handle Unverifiable Claims
+
+```python
+data = Data(
+    content="Python will become the most popular programming language in 2030.",  # Future prediction
+    context=["Python is currently one of the most popular programming languages."]
+)
+
+result = LLMFactCheck.eval(data)
+# Expected: Moderate score (4-6), future prediction cannot be verified
+```