This guide introduces how to use integrated hallucination detection features in Dingo, supporting two detection methods: HHEM-2.1-Open local model (recommended) and GPT-based cloud detection.
Hallucination detection evaluates whether LLM-generated responses contain factual contradictions with provided reference context. Particularly useful for:
- RAG System Evaluation: Detect consistency between generated responses and retrieved documents
- SFT Data Quality Assessment: Verify factual accuracy of responses in training data
- LLM Output Verification: Real-time detection of hallucination issues in model outputs
- Data Preparation: Provide response to detect and reference context
- Consistency Analysis: Judge if response is consistent with each context
- Score Calculation: Calculate overall hallucination score
- Threshold Judgment: Decide if flagging is needed based on set threshold
- Score Range: 0.0 - 1.0
- Score Meaning:
- 0.0 = No hallucination
- 1.0 = Complete hallucination
- Default Threshold: 0.5 (configurable)
from dingo.io.input import Data
data = Data(
data_id="test_1",
prompt="User's question", # Original question (optional)
content="LLM's response", # Response to detect
context=["Reference context 1", "Reference context 2"] # Reference context (required)
)# Method 1: String list
context = ["Context 1", "Context 2", "Context 3"]
# Method 2: Single string
context = "Complete context text"
# Method 3: Dict with passages key
context = {"passages": ["Context 1", "Context 2"]}Advantages:
- ✅ Fast speed
- ✅ No API costs
- ✅ Data privacy
- ✅ Can run offline
Installation:
# Install extra dependencies
pip install dingo-python[hhem]
# Or install dependencies manually
pip install sentence-transformers torchUsage:
from dingo.config.input_args import EvaluatorRuleArgs
from dingo.io.input import Data
from dingo.model.rule.rule_hallucination_hhem import RuleHallucinationHHEM
# Configure (first run will auto-download model ~400MB)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs(
threshold=0.5 # Hallucination threshold, higher = stricter
)
# Prepare data
data = Data(
data_id="test_1",
content="Paris is the capital of Germany.", # Response to detect
context=["Paris is the capital of France."] # Reference context
)
# Execute detection
result = RuleHallucinationHHEM.eval(data)
# View results
print(f"Score: {result.score}") # 0.0-1.0, higher = more hallucination
print(f"Has issues: {result.status}") # True = has hallucination, False = no hallucination
print(f"Reason: {result.reason}")Output Example:
Score: 0.85
Has issues: True
Reason: ['Hallucination detected (score: 0.85, threshold: 0.5). Inconsistent parts: Paris is capital of Germany (context states: Paris is capital of France)']
Advantages:
- ✅ No local model download needed
- ✅ High-quality detection with powerful LLM
- ✅ Easy integration
Usage:
import os
from dingo.config.input_args import EvaluatorLLMArgs
from dingo.io.input import Data
from dingo.model.llm.llm_hallucination import LLMHallucination
# Configure LLM
LLMHallucination.dynamic_config = EvaluatorLLMArgs(
key=os.getenv("OPENAI_API_KEY"),
api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
parameters={"threshold": 0.5}
)
# Prepare data
data = Data(
data_id="test_1",
content="Paris is the capital of Germany.",
context=["Paris is the capital of France."]
)
# Execute detection
result = LLMHallucination.eval(data)
# View results
print(f"Score: {result.score}")
print(f"Has issues: {result.status}")
print(f"Reason: {result.reason}")from dingo.config import InputArgs
from dingo.exec import Executor
input_data = {
"task_name": "hallucination_detection",
"input_path": "test/data/rag_responses.jsonl",
"output_path": "outputs/",
"dataset": {"source": "local", "format": "jsonl"},
"executor": {
"max_workers": 10,
"result_save": {
"good": True,
"bad": True,
"all_labels": True
}
},
"evaluator": [
{
"fields": {
"content": "response",
"context": "retrieved_contexts"
},
"evals": [
{
"name": "RuleHallucinationHHEM", # Or "LLMHallucination"
"config": {"threshold": 0.5}
}
]
}
]
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()
print(f"Total: {summary.total}")
print(f"Issues: {summary.num_bad}")
print(f"Pass rate: {summary.score}%"){"response": "Paris is the capital of France.", "retrieved_contexts": ["Paris is the capital of France.", "France is in Western Europe."]}
{"response": "Python was created by Guido van Rossum.", "retrieved_contexts": ["Python was designed by Guido van Rossum.", "Python was first released in 1991."]}# Method 1: Rule-based (HHEM)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs(
threshold=0.5 # Range: 0.0-1.0
)
# Method 2: LLM-based
LLMHallucination.dynamic_config = EvaluatorLLMArgs(
key="YOUR_API_KEY",
api_url="https://api.openai.com/v1",
model="gpt-4o-mini",
parameters={"threshold": 0.5} # Range: 0.0-1.0
)Threshold Recommendations:
- Strict scenarios (finance, medical): 0.3-0.4
- General scenarios (Q&A systems): 0.5-0.6
- Loose scenarios (creative content): 0.7-0.8
# Auto-select (default: uses GPU if available)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs()
# Force CPU
import torch
RuleHallucinationHHEM.device = "cpu"
# Force GPU
RuleHallucinationHHEM.device = "cuda"
# Specific GPU
RuleHallucinationHHEM.device = "cuda:0"| Feature | HHEM-2.1-Open | GPT-based |
|---|---|---|
| Speed | Fast (~50ms/sample) | Slower (~1-2s/sample) |
| Cost | Free | API costs |
| Accuracy | High (F1: 0.84) | Very High |
| Privacy | Local, secure | Data sent to API |
| Deployment | Needs model download (~400MB) | Needs API key |
| Offline | ✅ Supported | ❌ Requires network |
Recommendations:
- Production environment: HHEM-2.1-Open (fast, free, private)
- High-precision scenarios: GPT-based (highest accuracy)
- Offline scenarios: HHEM-2.1-Open (can run completely offline)
Good Context:
context = [
"Paris is the capital of France, located in northern France.",
"France is a country in Western Europe with a population of about 67 million."
]Poor Context:
context = [
"Paris", # Too short, lacks information
"France has many cities." # Too vague
]# When multiple contexts exist, system automatically analyzes consistency with each
data = Data(
content="Paris is the capital of France and the largest city in France.",
context=[
"Paris is the capital of France.", # Supports first half
"Paris is the largest city in France." # Supports second half
]
)- Initial Testing: Use default threshold (0.5)
- Analyze Results: Check for false positives/negatives
- Adjust Threshold: Refine based on business needs
- Verify Effects: Re-test with new threshold
"evaluator": [
{
"fields": {
"prompt": "user_input",
"content": "response",
"context": "retrieved_contexts"
},
"evals": [
{"name": "LLMRAGFaithfulness"}, # Faithfulness (based on LLM)
{"name": "RuleHallucinationHHEM"}, # Hallucination (model-based)
{"name": "LLMRAGAnswerRelevancy"} # Answer relevance
]
}
]- Production/large-scale: HHEM (fast, free, private)
- High-precision evaluation: GPT-based (highest accuracy, but has costs)
- Offline scenarios: HHEM (can run completely offline)
HHEM uses Sentence-Transformers model (~400MB), auto-downloads and caches on first run. Subsequent runs load directly from cache, no re-download needed.
# Manually download
huggingface-cli download vectara/hallucination_evaluation_model --local-dir ~/.cache/huggingface/hub/models--vectara--hallucination_evaluation_model
# Or use mirror
export HF_ENDPOINT=https://hf-mirror.com- 0.0-0.3: Low hallucination risk, response highly consistent with context
- 0.3-0.5: Moderate risk, some parts may be inconsistent, needs attention
- 0.5-0.7: High risk, significant inconsistencies, needs review
- 0.7-1.0: Severe hallucination, response seriously contradicts context
data = Data(
content="Python was released in 1995 by James Gosling.", # Wrong: year and author
context=["Python was created by Guido van Rossum and first released in 1991."]
)
result = RuleHallucinationHHEM.eval(data)
# Expected: High score (>0.7), detected as having hallucinationdata = Data(
content="Machine learning is a branch of AI. It was invented in the 1950s by Alan Turing.", # First sentence correct, second incorrect
context=["Machine learning is a subfield of artificial intelligence."]
)
result = RuleHallucinationHHEM.eval(data)
# Expected: Moderate score (0.4-0.6), partial hallucinationdata = Data(
content="Deep learning is a subset of machine learning that uses multi-layer neural networks.",
context=["Deep learning is part of machine learning, characterized by using multi-layer neural networks."]
)
result = RuleHallucinationHHEM.eval(data)
# Expected: Low score (<0.3), no hallucination