This guide explains how to create custom agent-based evaluators and tools in Dingo. Agent-based evaluation enhances traditional rule and LLM evaluators by adding multi-step reasoning, tool usage, and adaptive context gathering.
- Architecture Overview
- Agent Implementation Patterns
- Creating Custom Tools
- Creating Custom Agents
- Configuration
- Testing
- Best Practices
- Examples
Agents extend Dingo's evaluation capabilities:
Traditional Evaluation:
Data → Rule/LLM → EvalDetail
Agent-Based Evaluation:
Data → Agent → [Tool 1, Tool 2, ...] → LLM Reasoning → EvalDetail
Key Components:
- BaseAgent: Abstract base class for all agents (extends
BaseOpenAI) - Tool Registry: Manages available tools for agents
- BaseTool: Abstract interface for tool implementations
- Auto-Discovery: Agents registered via
@Model.llm_register()decorator
Execution Model:
- Agents run in ThreadPoolExecutor (same as LLMs) for I/O-bound operations
- Tools are called synchronously within the agent's execution
- Configuration injected via
dynamic_configattribute
Dingo supports three complementary patterns for implementing agent-based evaluators. All patterns share the same configuration interface and are transparent to users, allowing you to choose the approach that best fits your needs.
| Aspect | LangChain-Based | Custom Workflow | Agent-First + Context |
|---|---|---|---|
| Control | Framework-driven | Developer-driven | Framework + override |
| Complexity | Simple (declarative) | Moderate (imperative) | Moderate (hybrid) |
| Flexibility | Limited to LangChain | Unlimited | LangChain + artifacts |
| Code Volume | Low (~100 lines) | Medium (~200 lines) | High (~500+ lines) |
| Best For | Multi-step reasoning | Workflow composition | Article-level verification |
| Example | AgentFactCheck | AgentHallucination | ArticleFactChecker |
Philosophy: Let the framework handle orchestration, you focus on the task.
✅ Complex multi-step reasoning required The agent needs to make multiple decisions and tool calls adaptively
✅ Benefit from LangChain's battle-tested patterns Leverage proven agent orchestration and error handling
✅ Prefer declarative over imperative style Define what the agent should do, not how to do it step-by-step
✅ Want rapid prototyping Get a working agent with minimal code
❌ Need fine-grained control over every step You want to control exactly when and how tools are called
❌ Want to compose with existing Dingo evaluators You need to call other evaluators as part of the workflow
❌ Have domain-specific workflow requirements Your workflow doesn't fit the ReAct pattern well
- Set
use_agent_executor = Trueto enable LangChain path - Override
_format_agent_input()to structure input for the agent - Override
_get_system_prompt()to provide task-specific instructions - Implement
aggregate_results()to parse agent output into EvalDetail - Return empty list in
plan_execution()(not used with LangChain path)
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail
from typing import Any, List
@Model.llm_register("AgentFactCheck")
class AgentFactCheck(BaseAgent):
"""LangChain-based fact-checking agent."""
use_agent_executor = True # Enable LangChain agent mode
available_tools = ["tavily_search"]
max_iterations = 5
@classmethod
def _format_agent_input(cls, input_data: Data) -> str:
"""Structure input for the agent."""
parts = []
if hasattr(input_data, 'prompt') and input_data.prompt:
parts.append(f"**Question:**\n{input_data.prompt}")
parts.append(f"**Response to Evaluate:**\\n{input_data.content}")
if hasattr(input_data, 'context') and input_data.context:
parts.append(f"**Context:**\\n{input_data.context}")
else:
parts.append("**Context:** None - use web search to verify")
return "\\n\\n".join(parts)
@classmethod
def _get_system_prompt(cls, input_data: Data) -> str:
"""Provide task-specific instructions."""
has_context = hasattr(input_data, 'context') and input_data.context
base = """You are a fact-checking agent with web search capabilities.
Your task:
1. Analyze the Question and Response provided"""
context_instruction = (
"\\n2. Context is provided - evaluate the Response against it"
"\\n3. You MAY use web search for additional verification if needed"
if has_context else
"\\n2. NO Context is available - you MUST use web search to verify facts"
"\\n3. Search for reliable sources to fact-check the response"
)
output_format = """
**Output Format:**
HALLUCINATION_DETECTED: [YES or NO]
EXPLANATION: [Your analysis]
EVIDENCE: [Supporting facts]
SOURCES: [URLs, one per line with - prefix]
Be precise. Start with "HALLUCINATION_DETECTED:" followed by YES or NO."""
return base + context_instruction + output_format
@classmethod
def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
"""Parse agent output into EvalDetail."""
if not results:
return cls._create_error_result("No results from agent")
agent_result = results[0]
output = agent_result.get('output', '')
# Parse hallucination status
has_hallucination = cls._detect_hallucination_from_output(output)
# Build result
result = EvalDetail(metric=cls.__name__)
result.status = has_hallucination
result.label = ["BAD:HALLUCINATION" if has_hallucination else "GOOD"]
result.reason = [f"Agent Analysis:\\n{output}"]
return result
@classmethod
def plan_execution(cls, input_data: Data) -> List[Dict]:
"""Not used with LangChain agent (agent handles planning)."""
return []Pros:
- ✅ Less code to write and maintain
- ✅ Framework handles tool orchestration automatically
- ✅ Automatic retry and error handling
- ✅ Battle-tested ReAct pattern from LangChain
Cons:
- ❌ Limited to LangChain's agent patterns
- ❌ Less control over execution flow
- ❌ Debugging can be harder (framework abstraction)
- ❌ Cannot compose with existing Dingo evaluators
Philosophy: Explicit control over every step, compose with existing evaluators.
✅ Need fine-grained workflow control You want to control exactly what happens at each step
✅ Want to compose with existing Dingo evaluators Reuse evaluators like LLMHallucination within your workflow
✅ Prefer explicit over implicit behavior You want to see and control every tool call and LLM interaction
✅ Have domain-specific requirements Your workflow has unique steps that don't fit standard patterns
✅ Need conditional logic between steps Different paths based on intermediate results
❌ Want framework-managed multi-step reasoning You prefer the agent to figure out the steps autonomously
❌ Prefer minimal code You want a quick solution without manual orchestration
❌ Need rapid prototyping You don't want to write explicit workflow logic
❌ Complex reasoning benefits from ReAct Your task requires adaptive multi-step reasoning
- Implement custom
eval()method with explicit workflow logic - Manually call
execute_tool()for each tool operation - Manually call
send_messages()for LLM interactions - Optionally delegate to existing evaluators (e.g., LLMHallucination)
- Return
EvalDetaildirectly fromeval()
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail
from typing import List
@Model.llm_register("AgentHallucination")
class AgentHallucination(BaseAgent):
"""Custom workflow hallucination detector."""
available_tools = ["tavily_search"]
max_iterations = 3
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
"""Main evaluation method with custom workflow."""
cls.create_client() # Initialize LLM client
# Step 1: Check if context is available
has_context = cls._has_context(input_data)
if has_context:
# Path A: Use existing evaluator
return cls._eval_with_context(input_data)
else:
# Path B: Custom workflow with web search
return cls._eval_with_web_search(input_data)
@classmethod
def _eval_with_web_search(cls, input_data: Data) -> EvalDetail:
"""Execute custom workflow: extract claims → search → evaluate."""
# Step 2: Extract factual claims (manual LLM call)
claims = cls._extract_claims(input_data)
if not claims:
return cls._create_result(
status=False,
reason="No factual claims found to verify"
)
# Step 3: Search web for each claim (manual tool calls)
search_results = []
for claim in claims:
result = cls.execute_tool('tavily_search', query=claim)
if result.get('success'):
search_results.append(result['result'])
# Step 4: Synthesize context from search results
context = cls._synthesize_context(search_results)
# Step 5: Evaluate with synthesized context (delegate to evaluator)
data_with_context = Data(
content=input_data.content,
context=context
)
return cls._eval_with_context(data_with_context)
@classmethod
def _extract_claims(cls, input_data: Data) -> List[str]:
"""Extract factual claims using LLM."""
prompt = f"""Extract all factual claims from this text:
{input_data.content}
Return a JSON list of claims."""
messages = [{"role": "user", "content": prompt}]
response = cls.send_messages(messages)
# Parse claims from response
import json
try:
claims = json.loads(response)
return claims if isinstance(claims, list) else []
except json.JSONDecodeError:
return []
@classmethod
def _synthesize_context(cls, search_results: List[Dict]) -> str:
"""Synthesize context from search results using LLM."""
results_text = "\\n".join([
f"Source: {r.get('title', 'Unknown')}\\n{r.get('content', '')}"
for r in search_results
])
prompt = f"""Synthesize the following search results into a coherent context:
{results_text}
Provide a concise summary of the key facts."""
messages = [{"role": "user", "content": prompt}]
return cls.send_messages(messages)
@classmethod
def plan_execution(cls, input_data: Data) -> List[Dict]:
"""Not used with custom eval() method."""
return []Pros:
- ✅ Full control over execution flow
- ✅ Can compose with existing Dingo evaluators
- ✅ Explicit error handling at each step
- ✅ Easy to debug (no framework magic)
- ✅ Can implement complex conditional logic
Cons:
- ❌ More code to write and maintain
- ❌ Manual tool orchestration required
- ❌ Need to handle retries and errors manually
- ❌ More imperative, less declarative
Philosophy: Use LangChain's ReAct pattern for autonomous reasoning, override eval() and aggregate_results() for context tracking and artifact saving.
- Article-level comprehensive verification (many claims)
- Need intermediate artifacts (claims list, per-claim details, structured report)
- Want dual-layer output: human-readable text + structured data
- Benefit from thread-safe concurrent evaluation
- Set
use_agent_executor = True(same as Pattern 1) - Override
eval()with a two-phase async architecture:- Save article content to output directory
- Call
asyncio.run(cls._async_eval(input_data, ...))(bypasses_eval_with_langchain_agent) - Phase 1: Direct
ClaimsExtractor.execute()call (no agent overhead) - Phase 2: Per-claim verification via
asyncio.gather()+Semaphore(max_concurrent_claims)
- Each claim gets its own independent LangChain mini-agent:
_async_verify_single_claim()invokesAgentWrapper.async_invoke_and_format()- Results parsed by
_parse_claim_json_robust()(3-tier robust parser)
- Aggregation via
_aggregate_parallel_results()and_recalculate_summary()- Save artifacts (claims_extracted.jsonl, claims_verification.jsonl, report.json)
- Return EvalDetail with dual-layer reason:
[text_summary, report_dict]
import asyncio
import threading
class ArticleFactChecker(BaseAgent):
_thread_local = threading.local()
_claims_extractor_lock = threading.Lock() # Thread-safe config mutation
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
start_time = time.time()
output_dir = cls._get_output_dir()
if output_dir and input_data.content:
cls._save_article_content(output_dir, input_data.content)
try:
return asyncio.run(cls._async_eval(input_data, start_time, output_dir))
except RuntimeError:
# Fallback for already-running event loop (e.g., Jupyter)
loop = asyncio.new_event_loop()
return loop.run_until_complete(cls._async_eval(input_data, start_time, output_dir))
@classmethod
async def _async_eval(cls, input_data, start_time, output_dir) -> EvalDetail:
claims = await cls._async_extract_claims(input_data)
semaphore = asyncio.Semaphore(cls._get_max_concurrent_claims())
tasks = [cls._async_verify_single_claim(c, semaphore, ...) for c in claims]
results = await asyncio.gather(*tasks, return_exceptions=True)
return cls._build_eval_detail(results, start_time, output_dir, input_data)_get_output_dir() uses a three-priority chain (highest to lowest):
- Explicit path –
agent_config.output_pathis set → use it (backward-compatible) - Opt-out –
agent_config.save_artifacts=false→ returnNone, skip saving - Auto-generate – default behaviour:
outputs/article_factcheck_<timestamp>_<uuid>/- Override the base directory with
agent_config.base_output_path
- Override the base directory with
@classmethod
def _get_output_dir(cls) -> Optional[str]:
"""
Get output directory for artifact files (three-priority chain).
Returns output dir path (created if needed), or None if saving disabled.
"""
params = cls.dynamic_config.parameters or {}
agent_cfg = params.get('agent_config') or {}
explicit_path = agent_cfg.get('output_path')
if explicit_path:
os.makedirs(explicit_path, exist_ok=True)
return explicit_path
if agent_cfg.get('save_artifacts') is False:
return None # Opted out of artifact saving
base_output = agent_cfg.get('base_output_path') or 'outputs'
create_time = time.strftime("%Y%m%d_%H%M%S", time.localtime())
auto_path = os.path.join(base_output, f"article_factcheck_{create_time}_{uuid.uuid4().hex[:6]}")
os.makedirs(auto_path, exist_ok=True)
return auto_path# reason[0]: Human-readable text summary (str)
# reason[1]: Structured report dict (JSON-serializable, optional)
result.reason = [text_summary]
if report:
result.reason.append(report) # Dict, not strThis ensures the Dingo standard output contains both readable summaries and full structured data.
Full implementation: dingo/model/llm/agent/agent_article_fact_checker.py
Tests: test/scripts/model/llm/agent/test_article_fact_checker.py (88 tests),
test/scripts/model/llm/agent/test_async_article_fact_checker.py (30 tests)
Guide: docs/article_fact_checking_guide.md
Start
|
+- Do you need intermediate artifact saving (claims, reports)?
| +- Yes -> Use Agent-First + Context (ArticleFactChecker style)
| +- No -> Continue
|
+- Do you need to compose with existing Dingo evaluators?
| +- Yes -> Use Custom Pattern (AgentHallucination style)
| +- No -> Continue
|
+- Is your workflow highly domain-specific?
| +- Yes -> Use Custom Pattern
| +- No -> Continue
|
+- Do you prefer explicit control over every step?
| +- Yes -> Use Custom Pattern
| +- No -> Continue
|
+- Default -> Use LangChain Pattern (AgentFactCheck style)
Simpler, less code, battle-tested
Yes! You can use both patterns in the same project:
{
"evaluator": [{
"fields": {"content": "content"},
"evals": [
{"name": "AgentFactCheck"}, // LangChain-based
{"name": "AgentHallucination"} // Custom workflow
]
}]
}Users don't need to know which pattern you used - both share the same configuration interface and are transparent at the user level.
- Set
use_agent_executor = True - Move workflow logic from
eval()to_get_system_prompt() - Implement
aggregate_results()to parse agent output - Remove custom
eval()implementation
- Remove
use_agent_executorflag (or set to False) - Implement custom
eval()method with workflow logic - Manually call
execute_tool()andsend_messages() - Keep
plan_execution()returning empty list
Create a Pydantic model for type-safe configuration:
from pydantic import BaseModel, Field
from typing import Optional
class MyToolConfig(BaseModel):
"""Configuration for MyTool"""
api_key: Optional[str] = None
max_results: int = Field(default=10, ge=1, le=100)
timeout: int = Field(default=30, ge=1)from typing import Dict, Any
from dingo.model.llm.agent.tools.base_tool import BaseTool
from dingo.model.llm.agent.tools.tool_registry import tool_register
@tool_register
class MyTool(BaseTool):
"""
Brief description of what your tool does.
This tool provides... [detailed description]
Configuration:
api_key: API key for the service
max_results: Maximum number of results
timeout: Request timeout in seconds
"""
name = "my_tool" # Unique tool identifier
description = "Brief one-line description for agents"
config: MyToolConfig = MyToolConfig() # Default config
@classmethod
def execute(cls, **kwargs) -> Dict[str, Any]:
"""
Execute the tool with given parameters.
Args:
**kwargs: Tool-specific parameters
Returns:
Dict with:
- success: bool indicating if tool succeeded
- result: Tool output (format depends on tool)
- error: Error message if success=False
"""
try:
# Validate inputs
if not kwargs.get('query'):
return {
'success': False,
'error': 'Query parameter is required'
}
# Access configuration
api_key = cls.config.api_key
max_results = cls.config.max_results
# Execute tool logic
result = cls._perform_operation(kwargs['query'], api_key, max_results)
return {
'success': True,
'result': result,
'metadata': {
'query': kwargs['query'],
'timestamp': '...'
}
}
except Exception as e:
return {
'success': False,
'error': str(e),
'error_type': type(e).__name__
}
@classmethod
def _perform_operation(cls, query: str, api_key: str, max_results: int):
"""Private helper method for core logic"""
# Implementation details...
pass- Error Handling: Always return
{'success': False, 'error': ...}rather than raising exceptions - Validation: Validate inputs early and return clear error messages
- Configuration: Use Pydantic models with sensible defaults and validation
- Documentation: Include docstrings explaining parameters and return format
- Testing: Write comprehensive unit tests (see examples)
from typing import List, Dict, Any
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail, QualityLabel
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.utils import log
@Model.llm_register("MyAgent")
class MyAgent(BaseAgent):
"""
Brief description of your agent's purpose.
This agent evaluates... [detailed description]
Features:
- Feature 1
- Feature 2
- Feature 3
Configuration Example:
{
"name": "MyAgent",
"config": {
"key": "openai-api-key",
"api_url": "https://api.openai.com/v1",
"model": "gpt-4",
"parameters": {
"agent_config": {
"max_iterations": 3,
"tools": {
"my_tool": {
"api_key": "tool-api-key",
"max_results": 5
}
}
}
}
}
}
"""
# Metadata for documentation
_metric_info = {
"category": "Your Category",
"metric_name": "MyAgent",
"description": "Brief description",
"features": [
"Feature 1",
"Feature 2"
]
}
# Tools this agent can use
available_tools = ["my_tool", "another_tool"]
# Maximum reasoning iterations
max_iterations = 5
# Optional: Evaluation threshold
threshold = 0.5
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
"""
Main evaluation method.
Args:
input_data: Data object with content and optional fields
Returns:
EvalDetail with evaluation results
"""
try:
# Step 1: Initialize
cls.create_client()
# Step 2: Execute agent logic
result = cls._execute_workflow(input_data)
# Step 3: Return evaluation
return result
except Exception as e:
log.error(f"{cls.__name__} failed: {e}")
result = EvalDetail(metric=cls.__name__)
result.status = True # Error condition
result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}AGENT_ERROR"]
result.reason = [f"Agent workflow failed: {str(e)}"]
return result
@classmethod
def _execute_workflow(cls, input_data: Data) -> EvalDetail:
"""
Core workflow implementation.
This is where you implement your agent's reasoning logic.
"""
# Example workflow:
# 1. Analyze input
analysis = cls._analyze_input(input_data)
# 2. Use tools if needed
if analysis['needs_tool']:
tool_result = cls.execute_tool('my_tool', query=analysis['query'])
if not tool_result['success']:
# Handle tool failure
result = EvalDetail(metric=cls.__name__)
result.status = True
result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}TOOL_FAILED"]
result.reason = [f"Tool execution failed: {tool_result['error']}"]
return result
# 3. Make final decision using LLM
final_decision = cls._make_decision(input_data, tool_result)
# 4. Format result
result = EvalDetail(metric=cls.__name__)
result.status = final_decision['is_bad']
result.label = final_decision['labels']
result.reason = final_decision['reasons']
return result
@classmethod
def _analyze_input(cls, input_data: Data) -> Dict[str, Any]:
"""Analyze input to determine next steps"""
# Use LLM to analyze
prompt = f"Analyze this content: {input_data.content}"
messages = [{"role": "user", "content": prompt}]
response = cls.send_messages(messages)
# Parse response
return {'needs_tool': True, 'query': '...'}
@classmethod
def _make_decision(cls, input_data: Data, tool_result: Dict) -> Dict[str, Any]:
"""Make final evaluation decision"""
# Combine all information and decide
return {
'is_bad': False,
'labels': [QualityLabel.QUALITY_GOOD],
'reasons': ["Evaluation passed"]
}
@classmethod
def plan_execution(cls, input_data: Data) -> List[Dict[str, Any]]:
"""
Optional: Define execution plan for complex workflows.
Not required if you implement eval() directly.
"""
return []
@classmethod
def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
"""
Optional: Aggregate results from plan_execution.
Not required if you implement eval() directly.
"""
return EvalDetail(metric=cls.__name__)@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
# Check preconditions
if cls._has_required_data(input_data):
# Direct path
return cls._simple_evaluation(input_data)
else:
# Agent workflow with tools
return cls._agent_workflow(input_data)@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
steps = []
for i in range(cls.max_iterations):
# Analyze current state
analysis = cls._analyze_state(input_data, steps)
# Decide next action
action = cls._decide_action(analysis)
# Execute action (may call tools)
result = cls._execute_action(action)
steps.append(result)
# Check if done
if result['is_final']:
break
return cls._synthesize_result(steps)@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
# Use existing evaluator when appropriate
if cls._can_use_existing(input_data):
from dingo.model.llm.existing_model import ExistingModel
result = ExistingModel.eval(input_data)
# Add metadata
result.reason.append("Delegated to ExistingModel")
return result
# Otherwise use agent workflow
return cls._agent_workflow(input_data){
"evaluator": [{
"fields": {
"content": "response",
"prompt": "question",
"context": "contexts"
},
"evals": [{
"name": "MyAgent",
"config": {
"key": "openai-api-key",
"api_url": "https://api.openai.com/v1",
"model": "gpt-4-turbo",
"parameters": {
"temperature": 0.1,
"agent_config": {
"max_iterations": 3,
"tools": {
"my_tool": {
"api_key": "my-tool-api-key",
"max_results": 10,
"timeout": 30
},
"another_tool": {
"config_key": "value"
}
}
}
}
}
}]
}]
}# In your agent class
@classmethod
def some_method(cls):
# Access LLM configuration
model = cls.dynamic_config.model # "gpt-4-turbo"
temperature = cls.dynamic_config.parameters.get('temperature', 0)
# Access agent-specific configuration
agent_config = cls.dynamic_config.parameters.get('agent_config', {})
max_iterations = agent_config.get('max_iterations', 5)
# Get tool configuration
tool_config = cls.get_tool_config('my_tool')
# Returns: {"api_key": "...", "max_results": 10, "timeout": 30}# Configuration is injected automatically via config attribute
@classmethod
def execute(cls, **kwargs):
api_key = cls.config.api_key # From tool's config model
max_results = cls.config.max_results
# Use configuration...Dingo supports two execution paths for agents:
- Legacy Path (default): Manual loop with
plan_execution()andaggregate_results() - LangChain Path: Uses LangChain 1.0's
create_agent(enable withuse_agent_executor = True)
In LangChain 1.0, the max_iterations parameter is automatically converted to recursion_limit at runtime:
class MyAgent(BaseAgent):
use_agent_executor = True # Enable LangChain path
max_iterations = 10 # Converted to recursion_limit=10
_metric_info = {"metric_name": "MyAgent", "description": "..."}Configuration in JSON:
{
"name": "MyAgent",
"config": {
"parameters": {
"agent_config": {
"max_iterations": 10
}
}
}
}How it works:
max_iterationsin config → passed asrecursion_limitto LangChain- Default: 25 iterations (LangChain default)
- Range: 1-100 (adjust based on task complexity)
Note: LangChain 1.0 uses "recursion_limit" internally, but Dingo maintains the max_iterations terminology for consistency across both execution paths.
When using LangChain agents (use_agent_executor = True), you can customize how input data is formatted before being sent to the agent. This is essential for agents that need to work with structured data like prompt, content, and context together.
By default, BaseAgent passes only input_data.content to LangChain agents:
# Default implementation in BaseAgent
@classmethod
def _format_agent_input(cls, input_data: Data) -> str:
"""Format input data into text for LangChain agent."""
return input_data.contentTo include additional fields (prompt, context, etc.), override _format_agent_input in your agent:
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data
class MyCustomAgent(BaseAgent):
use_agent_executor = True
available_tools = ["tavily_search"]
@classmethod
def _format_agent_input(cls, input_data: Data) -> str:
"""Format prompt + content + context for agent."""
parts = []
# Include prompt if available
if hasattr(input_data, 'prompt') and input_data.prompt:
parts.append(f"**Question:**\n{input_data.prompt}")
# Always include content
parts.append(f"**Response to Evaluate:**\n{input_data.content}")
# Include context if available
if hasattr(input_data, 'context') and input_data.context:
if isinstance(input_data.context, list):
context_str = "\n".join(f"- {c}" for c in input_data.context)
else:
context_str = str(input_data.context)
parts.append(f"**Context:**\n{context_str}")
else:
parts.append("**Context:** None provided")
return "\n\n".join(parts)-
Safe Attribute Access: Use
hasattr()and check for truthinessif hasattr(input_data, 'prompt') and input_data.prompt: # Safe to use input_data.prompt
-
Clear Structure: Use markdown-style headers for readability
parts.append(f"**Section Name:**\n{content}")
-
Handle Multiple Types: Context might be string or list
if isinstance(input_data.context, list): context_str = "\n".join(f"- {c}" for c in input_data.context) else: context_str = str(input_data.context)
-
Provide Guidance: Tell the agent what to do when data is missing
parts.append("**Context:** None provided - use web search to verify")
AgentFactCheck demonstrates a production-ready implementation using _format_agent_input with structured output parsing following LangChain 2025 best practices.
- Autonomous Search Control: Agent decides when to use web search based on context availability
- Structured Output: Uses explicit format instructions for reliable parsing
- Robust Error Handling: Multi-layer fallback for parsing agent responses
- Context-Aware Prompts: System prompt adapts based on input data
- Enhanced Evidence Citation: Extracts and displays source URLs for verification (v1.1)
from typing import Any, Dict, List
import re
from dingo.io import Data
from dingo.io.input.required_field import RequiredField
from dingo.io.output.eval_detail import EvalDetail, QualityLabel
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
@Model.llm_register("AgentFactCheck")
class AgentFactCheck(BaseAgent):
"""
LangChain-based fact-checking agent with autonomous search control.
- With context: Agent MAY use web search for additional verification
- Without context: Agent MUST use web search to verify facts
"""
use_agent_executor = True # Enable LangChain agent
available_tools = ["tavily_search"]
max_iterations = 5
_required_fields = [RequiredField.PROMPT, RequiredField.CONTENT]
# Note: CONTEXT is optional - agent adapts
@classmethod
def _format_agent_input(cls, input_data: Data) -> str:
"""Format prompt + content + context for agent."""
parts = []
if hasattr(input_data, 'prompt') and input_data.prompt:
parts.append(f"**Question:**\n{input_data.prompt}")
parts.append(f"**Response to Evaluate:**\n{input_data.content}")
if hasattr(input_data, 'context') and input_data.context:
if isinstance(input_data.context, list):
context_str = "\n".join(f"- {c}" for c in input_data.context)
else:
context_str = str(input_data.context)
parts.append(f"**Context:**\n{context_str}")
else:
parts.append("**Context:** None provided - use web search to verify")
return "\n\n".join(parts)
@classmethod
def _get_system_prompt(cls, input_data: Data) -> str:
"""System prompt adapts based on context availability."""
has_context = hasattr(input_data, 'context') and input_data.context
base_instructions = """You are a fact-checking agent with web search capabilities.
Your task:
1. Analyze the Question and Response provided"""
if has_context:
context_instruction = """
2. Context is provided - evaluate the Response against it
3. You MAY use web search for additional verification if needed
4. Make your own decision about whether web search is necessary"""
else:
context_instruction = """
2. NO Context is available - you MUST use web search to verify facts
3. Search for reliable sources to fact-check the response"""
# Following LangChain best practices: explicit output format
output_format = """
**IMPORTANT: You must return your analysis in exactly this format:**
HALLUCINATION_DETECTED: [YES or NO]
EXPLANATION: [Your detailed analysis]
EVIDENCE: [Supporting sources or facts]
SOURCES: [List of URLs consulted, one per line with - prefix]
Example:
HALLUCINATION_DETECTED: YES
EXPLANATION: The response claims incorrect information.
EVIDENCE: According to reliable sources, this is false.
SOURCES:
- https://example.com/source1
- https://example.com/source2
Be precise and clear. Start your response with "HALLUCINATION_DETECTED:" followed by YES or NO.
Always include SOURCES with specific URLs when you perform web searches."""
return base_instructions + context_instruction + output_format
@classmethod
def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
"""Parse agent output to determine hallucination status."""
if not results:
return cls._create_error_result("No results from agent")
agent_result = results[0]
if not agent_result.get('success', True):
error_msg = agent_result.get('error', 'Unknown error')
return cls._create_error_result(error_msg)
output = agent_result.get('output', '')
if not output or not output.strip():
return cls._create_error_result("Agent returned empty output")
# Parse structured output
has_hallucination = cls._detect_hallucination_from_output(output)
result = EvalDetail(metric=cls.__name__)
result.status = has_hallucination
result.label = [
f"{QualityLabel.QUALITY_BAD_PREFIX}HALLUCINATION"
if has_hallucination
else QualityLabel.QUALITY_GOOD
]
result.reason = [
f"Agent Analysis:\n{output}",
f"🔍 Web searches: {len(agent_result.get('tool_calls', []))}",
f"🤖 Reasoning steps: {agent_result.get('reasoning_steps', 0)}"
]
return result
@classmethod
def _detect_hallucination_from_output(cls, output: str) -> bool:
"""
Parse agent output using structured format.
Strategy:
1. Regex match for "HALLUCINATION_DETECTED: YES/NO"
2. Check response start for marker
3. Fallback to keyword detection
"""
if not output:
return False
# Primary: Regex match
match = re.search(r'HALLUCINATION_DETECTED:\s*(YES|NO)', output, re.IGNORECASE)
if match:
return match.group(1).upper() == 'YES'
# Fallback: Keyword detection (check negatives first!)
output_lower = output.lower()
if any(kw in output_lower for kw in ['no hallucination detected', 'factually accurate']):
return False
if any(kw in output_lower for kw in ['hallucination detected', 'factual error']):
return True
return False # Default to no hallucination
@classmethod
def _create_error_result(cls, error_message: str) -> EvalDetail:
"""Create error result."""
result = EvalDetail(metric=cls.__name__)
result.status = True
result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}AGENT_ERROR"]
result.reason = [f"Agent evaluation failed: {error_message}"]
return result
@classmethod
def plan_execution(cls, input_data: Data) -> List[Dict[str, Any]]:
"""Not used with LangChain agent (agent handles planning)."""
return []- Structured Output Format: Explicitly defines expected format in system prompt
- Regex Parsing: Reliable primary parsing method
- Fallback Layers: Keyword detection as safety net
- Error Handling: Returns error status rather than crashing
- Context Awareness: Adapts behavior based on available data
{
"name": "AgentFactCheck",
"config": {
"key": "your-openai-api-key",
"api_url": "https://api.openai.com/v1",
"model": "gpt-4-turbo",
"parameters": {
"temperature": 0.1,
"max_tokens": 16384,
"agent_config": {
"max_iterations": 5,
"tools": {
"tavily_search": {
"api_key": "your-tavily-api-key",
"max_results": 5,
"search_depth": "advanced"
}
}
}
}
}
}from dingo.io import Data
from dingo.model.llm.agent.agent_fact_check import AgentFactCheck
# Test with context
data_with_context = Data(
prompt="What is the capital of France?",
content="The capital is Berlin",
context="France's capital is Paris"
)
# Test without context
data_without_context = Data(
prompt="What year was Python created?",
content="Python was created in 1995"
)
# Agent will adapt behavior automatically
result1 = AgentFactCheck.eval(data_with_context)
result2 = AgentFactCheck.eval(data_without_context)Full implementation: dingo/model/llm/agent/agent_fact_check.py
Tests: test/scripts/model/llm/agent/test_agent_fact_check.py (35 tests)
AgentFactCheck includes a feature to extract and display source URLs from the agent's output, making fact-checking results more transparent and verifiable.
How it works:
- System Prompt: Agent is instructed to include a SOURCES section with URLs
- Extraction:
_extract_sources_from_output()parses the SOURCES section - Display: Sources are appended to the result's reason field
Implementation:
@classmethod
def _extract_sources_from_output(cls, output: str) -> List[str]:
"""Extract source URLs from agent output."""
sources = []
in_sources_section = False
for line in output.split('\n'):
line = line.strip()
if line.upper().startswith('SOURCES:'):
in_sources_section = True
continue
if in_sources_section:
# Check if we've reached a new section
if line and ':' in line:
section_header = line.split(':')[0].upper()
if section_header in ['EXPLANATION', 'EVIDENCE', 'HALLUCINATION_DETECTED']:
break
# Extract URL (with - or • prefix, or direct URL)
if line.startswith(('- ', '• ', 'http://', 'https://')):
url = line.lstrip('- •').strip()
if url:
sources.append(url)
return sourcesUsage in aggregate_results:
# Extract sources from output
sources = cls._extract_sources_from_output(output)
# Add sources section to result
result.reason.append("")
if sources:
result.reason.append("📚 Sources consulted:")
for source in sources:
result.reason.append(f" • {source}")
else:
result.reason.append("📚 Sources: None explicitly cited")Benefits:
- ✅ Increases transparency of agent's fact-checking process
- ✅ Allows users to verify the agent's judgment independently
- ✅ Provides attribution for evidence used in evaluation
- ✅ Meets academic and professional citation standards
Example Output:
Agent Analysis:
HALLUCINATION_DETECTED: YES
EXPLANATION: The response claims the Eiffel Tower is 450 meters tall, but it is actually 330 meters.
EVIDENCE: According to the official Eiffel Tower website, the height is 330 meters including antennas.
SOURCES:
- https://www.toureiffel.paris/en/the-monument
- https://en.wikipedia.org/wiki/Eiffel_Tower
🔍 Web searches performed: 2
🤖 Reasoning steps: 4
⚙️ Agent autonomously decided: Use web search
📚 Sources consulted:
• https://www.toureiffel.paris/en/the-monument
• https://en.wikipedia.org/wiki/Eiffel_Tower
import pytest
from unittest.mock import patch, MagicMock
from my_tool import MyTool, MyToolConfig
class TestMyTool:
def setup_method(self):
"""Setup for each test"""
MyTool.config = MyToolConfig(api_key="test_key")
def test_successful_execution(self):
"""Test successful tool execution"""
result = MyTool.execute(query="test query")
assert result['success'] is True
assert 'result' in result
def test_missing_query(self):
"""Test error handling for missing query"""
result = MyTool.execute()
assert result['success'] is False
assert 'Query parameter is required' in result['error']
@patch('external_api.Client')
def test_with_mocked_api(self, mock_client):
"""Test with mocked external API"""
mock_response = {"data": "test"}
mock_client_instance = MagicMock()
mock_client_instance.search.return_value = mock_response
mock_client.return_value = mock_client_instance
result = MyTool.execute(query="test")
assert result['success'] is True
mock_client_instance.search.assert_called_once()import pytest
from unittest.mock import patch
from dingo.io import Data
from my_agent import MyAgent
from dingo.config.input_args import EvaluatorLLMArgs
class TestMyAgent:
def setup_method(self):
"""Setup for each test"""
MyAgent.dynamic_config = EvaluatorLLMArgs(
key="test_key",
api_url="https://api.test.com",
model="gpt-4"
)
def test_agent_registration(self):
"""Test that agent is properly registered"""
from dingo.model import Model
Model.load_model()
assert "MyAgent" in Model.llm_name_map
@patch.object(MyAgent, 'execute_tool')
@patch.object(MyAgent, 'send_messages')
def test_workflow_execution(self, mock_send, mock_tool):
"""Test complete agent workflow"""
# Mock LLM responses
mock_send.return_value = "Analysis result"
# Mock tool responses
mock_tool.return_value = {
'success': True,
'result': 'Tool output'
}
# Execute
data = Data(content="Test content")
result = MyAgent.eval(data)
# Verify
assert result.status is not None
assert mock_send.called
assert mock_tool.called- Start Simple: Begin with basic workflow, add complexity as needed
- Error Handling: Wrap workflow in try/except, return meaningful error messages
- Logging: Use
log.info(),log.warning(),log.error()for debugging - Delegation: Reuse existing evaluators when possible
- Documentation: Include comprehensive docstrings and configuration examples
- Metadata: Add
_metric_infofor documentation generation
- Single Responsibility: Each tool should do one thing well
- Configuration: Use Pydantic models with validation
- Return Format: Always return dict with
successboolean - Error Messages: Provide actionable error messages
- Testing: Write unit tests covering success and error cases
- Limit Iterations: Set reasonable
max_iterationsto prevent infinite loops - Batch Operations: If calling tool multiple times, consider batching
- Caching: Consider caching expensive operations
- Timeouts: Set appropriate timeouts for external API calls
- API Keys: Never hardcode API keys, use configuration
- Input Validation: Validate all inputs before passing to external services
- Rate Limiting: Respect API rate limits in tools
- Error Information: Don't expose sensitive information in error messages
- AgentHallucination:
dingo/model/llm/agent/agent_hallucination.py- Production agent with web search - AgentFactCheck:
examples/agent/agent_executor_example.py- LangChain 1.0 agent example - ArticleFactChecker:
dingo/model/llm/agent/agent_article_fact_checker.py- Agent-First with context tracking and artifact saving - ArticleFactChecker Example:
examples/agent/agent_article_fact_checking_example.py- Full article fact-checking example - TavilySearch Tool:
dingo/model/llm/agent/tools/tavily_search.py- Web search tool implementation - ClaimsExtractor Tool:
dingo/model/llm/agent/tools/claims_extractor.py- LLM-based claims extraction tool - ArxivSearch Tool:
dingo/model/llm/agent/tools/arxiv_search.py- Academic paper search tool
Note: For complete implementation examples, refer to the files above. They demonstrate real-world patterns for agent and tool development.
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.model import Model
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail
@Model.llm_register("FactChecker")
class FactChecker(BaseAgent):
"""Simple fact checker using web search"""
available_tools = ["tavily_search"]
max_iterations = 1
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
cls.create_client()
# Search for facts
search_result = cls.execute_tool(
'tavily_search',
query=input_data.content
)
if not search_result['success']:
return cls._create_error_result("Search failed")
# Verify with LLM
prompt = f"""
Content: {input_data.content}
Search Results: {search_result['answer']}
Are there any factual errors? Respond with YES or NO.
"""
response = cls.send_messages([
{"role": "user", "content": prompt}
])
result = EvalDetail(metric="FactChecker")
result.status = "YES" in response.upper()
result.reason = [f"Verification: {response}"]
return resultfrom dingo.config import InputArgs
from dingo.exec import Executor
config = {
"input_path": "data.jsonl",
"output_path": "outputs/",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [{
"fields": {"content": "text"},
"evals": [{
"name": "FactChecker",
"config": {
"key": "openai-key",
"api_url": "https://api.openai.com/v1",
"model": "gpt-4",
"parameters": {
"agent_config": {
"tools": {
"tavily_search": {"api_key": "tavily-key"}
}
}
}
}
}]
}]
}
input_args = InputArgs(**config)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()Agent not found:
- Ensure file is in
dingo/model/llm/agent/directory - Check
@Model.llm_register("Name")decorator is present - Run
Model.load_model()to trigger auto-discovery
Tool not found:
- Ensure
@tool_registerdecorator is present - Check tool name matches string in
available_tools - Verify tool file is imported in
dingo/model/llm/agent/tools/__init__.py
Configuration not working:
- Check JSON structure matches expected format
- Verify
parameters.agent_config.tools.{tool_name}structure - Use Pydantic validation to catch config errors early
Tests failing:
- Patch at correct import path (where object is used, not defined)
- Mock external APIs to avoid network calls
- Check test isolation (use
setup_methodto reset state)
- AgentHallucination Implementation
- ArticleFactChecker Implementation
- BaseAgent Source
- Tool Registry Source
- Tavily Search Example
- Claims Extractor
- ArxivSearch
- Example Usage
- Article Fact-Checking Example
- Article Fact-Checking Guide
When contributing new agents or tools:
- Follow existing code style (flake8, isort)
- Add comprehensive tests (aim for >80% coverage)
- Include docstrings and type hints
- Update this guide if adding new patterns
- Add examples in
examples/agent/ - Update metrics documentation in
docs/metrics.md
For questions or suggestions, please open an issue on GitHub.