Agent-Based Evaluation Development Guide

Overview

This guide explains how to create custom agent-based evaluators and tools in Dingo. Agent-based evaluation enhances traditional rule and LLM evaluators by adding multi-step reasoning, tool usage, and adaptive context gathering.

Architecture Overview
Agent Implementation Patterns
Creating Custom Tools
Creating Custom Agents
Configuration
Testing
Best Practices
Examples

Architecture Overview

How Agents Fit in Dingo

Agents extend Dingo's evaluation capabilities:

Traditional Evaluation:
Data → Rule/LLM → EvalDetail

Agent-Based Evaluation:
Data → Agent → [Tool 1, Tool 2, ...] → LLM Reasoning → EvalDetail

Key Components:

BaseAgent: Abstract base class for all agents (extends BaseOpenAI)
Tool Registry: Manages available tools for agents
BaseTool: Abstract interface for tool implementations
Auto-Discovery: Agents registered via @Model.llm_register() decorator

Execution Model:

Agents run in ThreadPoolExecutor (same as LLMs) for I/O-bound operations
Tools are called synchronously within the agent's execution
Configuration injected via dynamic_config attribute

Agent Implementation Patterns

Dingo supports three complementary patterns for implementing agent-based evaluators. All patterns share the same configuration interface and are transparent to users, allowing you to choose the approach that best fits your needs.

Pattern Comparison

Aspect	LangChain-Based	Custom Workflow	Agent-First + Context
Control	Framework-driven	Developer-driven	Framework + override
Complexity	Simple (declarative)	Moderate (imperative)	Moderate (hybrid)
Flexibility	Limited to LangChain	Unlimited	LangChain + artifacts
Code Volume	Low (~100 lines)	Medium (~200 lines)	High (~500+ lines)
Best For	Multi-step reasoning	Workflow composition	Article-level verification
Example	AgentFactCheck	AgentHallucination	ArticleFactChecker

Pattern 1: LangChain-Based Agents (Framework-Driven)

Philosophy: Let the framework handle orchestration, you focus on the task.

When to Use

✅ Complex multi-step reasoning required The agent needs to make multiple decisions and tool calls adaptively

✅ Benefit from LangChain's battle-tested patterns Leverage proven agent orchestration and error handling

✅ Prefer declarative over imperative style Define what the agent should do, not how to do it step-by-step

✅ Want rapid prototyping Get a working agent with minimal code

When NOT to Use

❌ Need fine-grained control over every step You want to control exactly when and how tools are called

❌ Want to compose with existing Dingo evaluators You need to call other evaluators as part of the workflow

❌ Have domain-specific workflow requirements Your workflow doesn't fit the ReAct pattern well

Key Implementation Steps

Set use_agent_executor = True to enable LangChain path
Override _format_agent_input() to structure input for the agent
Override _get_system_prompt() to provide task-specific instructions
Implement aggregate_results() to parse agent output into EvalDetail
Return empty list in plan_execution() (not used with LangChain path)

Example: AgentFactCheck

from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail
from typing import Any, List

@Model.llm_register("AgentFactCheck")
class AgentFactCheck(BaseAgent):
    """LangChain-based fact-checking agent."""

    use_agent_executor = True  # Enable LangChain agent mode
    available_tools = ["tavily_search"]
    max_iterations = 5

    @classmethod
    def _format_agent_input(cls, input_data: Data) -> str:
        """Structure input for the agent."""
        parts = []

        if hasattr(input_data, 'prompt') and input_data.prompt:
            parts.append(f"**Question:**\n{input_data.prompt}")

        parts.append(f"**Response to Evaluate:**\\n{input_data.content}")

        if hasattr(input_data, 'context') and input_data.context:
            parts.append(f"**Context:**\\n{input_data.context}")
        else:
            parts.append("**Context:** None - use web search to verify")

        return "\\n\\n".join(parts)

    @classmethod
    def _get_system_prompt(cls, input_data: Data) -> str:
        """Provide task-specific instructions."""
        has_context = hasattr(input_data, 'context') and input_data.context

        base = """You are a fact-checking agent with web search capabilities.

Your task:
1. Analyze the Question and Response provided"""

        context_instruction = (
            "\\n2. Context is provided - evaluate the Response against it"
            "\\n3. You MAY use web search for additional verification if needed"
            if has_context else
            "\\n2. NO Context is available - you MUST use web search to verify facts"
            "\\n3. Search for reliable sources to fact-check the response"
        )

        output_format = """

**Output Format:**
HALLUCINATION_DETECTED: [YES or NO]
EXPLANATION: [Your analysis]
EVIDENCE: [Supporting facts]
SOURCES: [URLs, one per line with - prefix]

Be precise. Start with "HALLUCINATION_DETECTED:" followed by YES or NO."""

        return base + context_instruction + output_format

    @classmethod
    def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
        """Parse agent output into EvalDetail."""
        if not results:
            return cls._create_error_result("No results from agent")

        agent_result = results[0]
        output = agent_result.get('output', '')

        # Parse hallucination status
        has_hallucination = cls._detect_hallucination_from_output(output)

        # Build result
        result = EvalDetail(metric=cls.__name__)
        result.status = has_hallucination
        result.label = ["BAD:HALLUCINATION" if has_hallucination else "GOOD"]
        result.reason = [f"Agent Analysis:\\n{output}"]

        return result

    @classmethod
    def plan_execution(cls, input_data: Data) -> List[Dict]:
        """Not used with LangChain agent (agent handles planning)."""
        return []

Pros and Cons

Pros:

✅ Less code to write and maintain
✅ Framework handles tool orchestration automatically
✅ Automatic retry and error handling
✅ Battle-tested ReAct pattern from LangChain

Cons:

❌ Limited to LangChain's agent patterns
❌ Less control over execution flow
❌ Debugging can be harder (framework abstraction)
❌ Cannot compose with existing Dingo evaluators

Pattern 2: Custom Workflow Agents (Imperative)

Philosophy: Explicit control over every step, compose with existing evaluators.

When to Use

✅ Need fine-grained workflow control You want to control exactly what happens at each step

✅ Want to compose with existing Dingo evaluators Reuse evaluators like LLMHallucination within your workflow

✅ Prefer explicit over implicit behavior You want to see and control every tool call and LLM interaction

✅ Have domain-specific requirements Your workflow has unique steps that don't fit standard patterns

✅ Need conditional logic between steps Different paths based on intermediate results

When NOT to Use

❌ Want framework-managed multi-step reasoning You prefer the agent to figure out the steps autonomously

❌ Prefer minimal code You want a quick solution without manual orchestration

❌ Need rapid prototyping You don't want to write explicit workflow logic

❌ Complex reasoning benefits from ReAct Your task requires adaptive multi-step reasoning

Key Implementation Steps

Implement custom eval() method with explicit workflow logic
Manually call execute_tool() for each tool operation
Manually call send_messages() for LLM interactions
Optionally delegate to existing evaluators (e.g., LLMHallucination)
Return EvalDetail directly from eval()

Example: AgentHallucination

from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail
from typing import List

@Model.llm_register("AgentHallucination")
class AgentHallucination(BaseAgent):
    """Custom workflow hallucination detector."""

    available_tools = ["tavily_search"]
    max_iterations = 3

    @classmethod
    def eval(cls, input_data: Data) -> EvalDetail:
        """Main evaluation method with custom workflow."""
        cls.create_client()  # Initialize LLM client

        # Step 1: Check if context is available
        has_context = cls._has_context(input_data)

        if has_context:
            # Path A: Use existing evaluator
            return cls._eval_with_context(input_data)
        else:
            # Path B: Custom workflow with web search
            return cls._eval_with_web_search(input_data)

    @classmethod
    def _eval_with_web_search(cls, input_data: Data) -> EvalDetail:
        """Execute custom workflow: extract claims → search → evaluate."""

        # Step 2: Extract factual claims (manual LLM call)
        claims = cls._extract_claims(input_data)

        if not claims:
            return cls._create_result(
                status=False,
                reason="No factual claims found to verify"
            )

        # Step 3: Search web for each claim (manual tool calls)
        search_results = []
        for claim in claims:
            result = cls.execute_tool('tavily_search', query=claim)
            if result.get('success'):
                search_results.append(result['result'])

        # Step 4: Synthesize context from search results
        context = cls._synthesize_context(search_results)

        # Step 5: Evaluate with synthesized context (delegate to evaluator)
        data_with_context = Data(
            content=input_data.content,
            context=context
        )
        return cls._eval_with_context(data_with_context)

    @classmethod
    def _extract_claims(cls, input_data: Data) -> List[str]:
        """Extract factual claims using LLM."""
        prompt = f"""Extract all factual claims from this text:
{input_data.content}

Return a JSON list of claims."""

        messages = [{"role": "user", "content": prompt}]
        response = cls.send_messages(messages)

        # Parse claims from response
        import json
        try:
            claims = json.loads(response)
            return claims if isinstance(claims, list) else []
        except json.JSONDecodeError:
            return []

    @classmethod
    def _synthesize_context(cls, search_results: List[Dict]) -> str:
        """Synthesize context from search results using LLM."""
        results_text = "\\n".join([
            f"Source: {r.get('title', 'Unknown')}\\n{r.get('content', '')}"
            for r in search_results
        ])

        prompt = f"""Synthesize the following search results into a coherent context:

{results_text}

Provide a concise summary of the key facts."""

        messages = [{"role": "user", "content": prompt}]
        return cls.send_messages(messages)

    @classmethod
    def plan_execution(cls, input_data: Data) -> List[Dict]:
        """Not used with custom eval() method."""
        return []

Pros and Cons

Pros:

✅ Full control over execution flow
✅ Can compose with existing Dingo evaluators
✅ Explicit error handling at each step
✅ Easy to debug (no framework magic)
✅ Can implement complex conditional logic

Cons:

❌ More code to write and maintain
❌ Manual tool orchestration required
❌ Need to handle retries and errors manually
❌ More imperative, less declarative

Pattern 3: Agent-First with Context Tracking (ArticleFactChecker)

Philosophy: Use LangChain's ReAct pattern for autonomous reasoning, override eval() and aggregate_results() for context tracking and artifact saving.

When to Use

Article-level comprehensive verification (many claims)
Need intermediate artifacts (claims list, per-claim details, structured report)
Want dual-layer output: human-readable text + structured data
Benefit from thread-safe concurrent evaluation

Key Implementation Steps

Set use_agent_executor = True (same as Pattern 1)
Override eval() with a two-phase async architecture:
- Save article content to output directory
- Call asyncio.run(cls._async_eval(input_data, ...)) (bypasses _eval_with_langchain_agent)
- Phase 1: Direct ClaimsExtractor.execute() call (no agent overhead)
- Phase 2: Per-claim verification via asyncio.gather() + Semaphore(max_concurrent_claims)
Each claim gets its own independent LangChain mini-agent:
- _async_verify_single_claim() invokes AgentWrapper.async_invoke_and_format()
- Results parsed by _parse_claim_json_robust() (3-tier robust parser)
Aggregation via _aggregate_parallel_results() and _recalculate_summary()
- Save artifacts (claims_extracted.jsonl, claims_verification.jsonl, report.json)
- Return EvalDetail with dual-layer reason: [text_summary, report_dict]

Async Parallel Execution Pattern

import asyncio
import threading

class ArticleFactChecker(BaseAgent):
    _thread_local = threading.local()
    _claims_extractor_lock = threading.Lock()  # Thread-safe config mutation

    @classmethod
    def eval(cls, input_data: Data) -> EvalDetail:
        start_time = time.time()
        output_dir = cls._get_output_dir()
        if output_dir and input_data.content:
            cls._save_article_content(output_dir, input_data.content)
        try:
            return asyncio.run(cls._async_eval(input_data, start_time, output_dir))
        except RuntimeError:
            # Fallback for already-running event loop (e.g., Jupyter)
            loop = asyncio.new_event_loop()
            return loop.run_until_complete(cls._async_eval(input_data, start_time, output_dir))

    @classmethod
    async def _async_eval(cls, input_data, start_time, output_dir) -> EvalDetail:
        claims = await cls._async_extract_claims(input_data)
        semaphore = asyncio.Semaphore(cls._get_max_concurrent_claims())
        tasks = [cls._async_verify_single_claim(c, semaphore, ...) for c in claims]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return cls._build_eval_detail(results, start_time, output_dir, input_data)

Output Path Access Pattern

_get_output_dir() uses a three-priority chain (highest to lowest):

Explicit path – agent_config.output_path is set → use it (backward-compatible)
Opt-out – agent_config.save_artifacts=false → return None, skip saving
Auto-generate – default behaviour: outputs/article_factcheck_<timestamp>_<uuid>/
- Override the base directory with agent_config.base_output_path

@classmethod
def _get_output_dir(cls) -> Optional[str]:
    """
    Get output directory for artifact files (three-priority chain).
    Returns output dir path (created if needed), or None if saving disabled.
    """
    params = cls.dynamic_config.parameters or {}
    agent_cfg = params.get('agent_config') or {}

    explicit_path = agent_cfg.get('output_path')
    if explicit_path:
        os.makedirs(explicit_path, exist_ok=True)
        return explicit_path

    if agent_cfg.get('save_artifacts') is False:
        return None  # Opted out of artifact saving

    base_output = agent_cfg.get('base_output_path') or 'outputs'
    create_time = time.strftime("%Y%m%d_%H%M%S", time.localtime())
    auto_path = os.path.join(base_output, f"article_factcheck_{create_time}_{uuid.uuid4().hex[:6]}")
    os.makedirs(auto_path, exist_ok=True)
    return auto_path

Dual-Layer EvalDetail.reason

# reason[0]: Human-readable text summary (str)
# reason[1]: Structured report dict (JSON-serializable, optional)
result.reason = [text_summary]
if report:
    result.reason.append(report)  # Dict, not str

This ensures the Dingo standard output contains both readable summaries and full structured data.

Full implementation: dingo/model/llm/agent/agent_article_fact_checker.py Tests: test/scripts/model/llm/agent/test_article_fact_checker.py (88 tests), test/scripts/model/llm/agent/test_async_article_fact_checker.py (30 tests) Guide: docs/article_fact_checking_guide.md

Decision Tree: Which Pattern Should I Use?

Start
  |
  +- Do you need intermediate artifact saving (claims, reports)?
  |    +- Yes -> Use Agent-First + Context (ArticleFactChecker style)
  |    +- No  -> Continue
  |
  +- Do you need to compose with existing Dingo evaluators?
  |    +- Yes -> Use Custom Pattern (AgentHallucination style)
  |    +- No  -> Continue
  |
  +- Is your workflow highly domain-specific?
  |    +- Yes -> Use Custom Pattern
  |    +- No  -> Continue
  |
  +- Do you prefer explicit control over every step?
  |    +- Yes -> Use Custom Pattern
  |    +- No  -> Continue
  |
  +- Default -> Use LangChain Pattern (AgentFactCheck style)
       Simpler, less code, battle-tested

Can I Mix Both Patterns?

Yes! You can use both patterns in the same project:

{
  "evaluator": [{
    "fields": {"content": "content"},
    "evals": [
      {"name": "AgentFactCheck"},      // LangChain-based
      {"name": "AgentHallucination"}   // Custom workflow
    ]
  }]
}

Users don't need to know which pattern you used - both share the same configuration interface and are transparent at the user level.

Migration Path

From Custom to LangChain

Set use_agent_executor = True
Move workflow logic from eval() to _get_system_prompt()
Implement aggregate_results() to parse agent output
Remove custom eval() implementation

From LangChain to Custom

Remove use_agent_executor flag (or set to False)
Implement custom eval() method with workflow logic
Manually call execute_tool() and send_messages()
Keep plan_execution() returning empty list

Creating Custom Tools

Step 1: Define Tool Configuration

Create a Pydantic model for type-safe configuration:

from pydantic import BaseModel, Field
from typing import Optional

class MyToolConfig(BaseModel):
    """Configuration for MyTool"""
    api_key: Optional[str] = None
    max_results: int = Field(default=10, ge=1, le=100)
    timeout: int = Field(default=30, ge=1)

Step 2: Implement Tool Class

from typing import Dict, Any
from dingo.model.llm.agent.tools.base_tool import BaseTool
from dingo.model.llm.agent.tools.tool_registry import tool_register

@tool_register
class MyTool(BaseTool):
    """
    Brief description of what your tool does.

    This tool provides... [detailed description]

    Configuration:
        api_key: API key for the service
        max_results: Maximum number of results
        timeout: Request timeout in seconds
    """

    name = "my_tool"  # Unique tool identifier
    description = "Brief one-line description for agents"
    config: MyToolConfig = MyToolConfig()  # Default config

    @classmethod
    def execute(cls, **kwargs) -> Dict[str, Any]:
        """
        Execute the tool with given parameters.

        Args:
            **kwargs: Tool-specific parameters

        Returns:
            Dict with:
                - success: bool indicating if tool succeeded
                - result: Tool output (format depends on tool)
                - error: Error message if success=False
        """
        try:
            # Validate inputs
            if not kwargs.get('query'):
                return {
                    'success': False,
                    'error': 'Query parameter is required'
                }

            # Access configuration
            api_key = cls.config.api_key
            max_results = cls.config.max_results

            # Execute tool logic
            result = cls._perform_operation(kwargs['query'], api_key, max_results)

            return {
                'success': True,
                'result': result,
                'metadata': {
                    'query': kwargs['query'],
                    'timestamp': '...'
                }
            }

        except Exception as e:
            return {
                'success': False,
                'error': str(e),
                'error_type': type(e).__name__
            }

    @classmethod
    def _perform_operation(cls, query: str, api_key: str, max_results: int):
        """Private helper method for core logic"""
        # Implementation details...
        pass

Tool Best Practices

Error Handling: Always return {'success': False, 'error': ...} rather than raising exceptions
Validation: Validate inputs early and return clear error messages
Configuration: Use Pydantic models with sensible defaults and validation
Documentation: Include docstrings explaining parameters and return format
Testing: Write comprehensive unit tests (see examples)

Creating Custom Agents

Step 1: Create Agent Class

from typing import List, Dict, Any
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail, QualityLabel
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.utils import log

@Model.llm_register("MyAgent")
class MyAgent(BaseAgent):
    """
    Brief description of your agent's purpose.

    This agent evaluates... [detailed description]

    Features:
        - Feature 1
        - Feature 2
        - Feature 3

    Configuration Example:
    {
        "name": "MyAgent",
        "config": {
            "key": "openai-api-key",
            "api_url": "https://api.openai.com/v1",
            "model": "gpt-4",
            "parameters": {
                "agent_config": {
                    "max_iterations": 3,
                    "tools": {
                        "my_tool": {
                            "api_key": "tool-api-key",
                            "max_results": 5
                        }
                    }
                }
            }
        }
    }
    """

    # Metadata for documentation
    _metric_info = {
        "category": "Your Category",
        "metric_name": "MyAgent",
        "description": "Brief description",
        "features": [
            "Feature 1",
            "Feature 2"
        ]
    }

    # Tools this agent can use
    available_tools = ["my_tool", "another_tool"]

    # Maximum reasoning iterations
    max_iterations = 5

    # Optional: Evaluation threshold
    threshold = 0.5

    @classmethod
    def eval(cls, input_data: Data) -> EvalDetail:
        """
        Main evaluation method.

        Args:
            input_data: Data object with content and optional fields

        Returns:
            EvalDetail with evaluation results
        """
        try:
            # Step 1: Initialize
            cls.create_client()

            # Step 2: Execute agent logic
            result = cls._execute_workflow(input_data)

            # Step 3: Return evaluation
            return result

        except Exception as e:
            log.error(f"{cls.__name__} failed: {e}")
            result = EvalDetail(metric=cls.__name__)
            result.status = True  # Error condition
            result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}AGENT_ERROR"]
            result.reason = [f"Agent workflow failed: {str(e)}"]
            return result

    @classmethod
    def _execute_workflow(cls, input_data: Data) -> EvalDetail:
        """
        Core workflow implementation.

        This is where you implement your agent's reasoning logic.
        """
        # Example workflow:
        # 1. Analyze input
        analysis = cls._analyze_input(input_data)

        # 2. Use tools if needed
        if analysis['needs_tool']:
            tool_result = cls.execute_tool('my_tool', query=analysis['query'])

            if not tool_result['success']:
                # Handle tool failure
                result = EvalDetail(metric=cls.__name__)
                result.status = True
                result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}TOOL_FAILED"]
                result.reason = [f"Tool execution failed: {tool_result['error']}"]
                return result

        # 3. Make final decision using LLM
        final_decision = cls._make_decision(input_data, tool_result)

        # 4. Format result
        result = EvalDetail(metric=cls.__name__)
        result.status = final_decision['is_bad']
        result.label = final_decision['labels']
        result.reason = final_decision['reasons']

        return result

    @classmethod
    def _analyze_input(cls, input_data: Data) -> Dict[str, Any]:
        """Analyze input to determine next steps"""
        # Use LLM to analyze
        prompt = f"Analyze this content: {input_data.content}"
        messages = [{"role": "user", "content": prompt}]
        response = cls.send_messages(messages)

        # Parse response
        return {'needs_tool': True, 'query': '...'}

    @classmethod
    def _make_decision(cls, input_data: Data, tool_result: Dict) -> Dict[str, Any]:
        """Make final evaluation decision"""
        # Combine all information and decide
        return {
            'is_bad': False,
            'labels': [QualityLabel.QUALITY_GOOD],
            'reasons': ["Evaluation passed"]
        }

    @classmethod
    def plan_execution(cls, input_data: Data) -> List[Dict[str, Any]]:
        """
        Optional: Define execution plan for complex workflows.

        Not required if you implement eval() directly.
        """
        return []

    @classmethod
    def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
        """
        Optional: Aggregate results from plan_execution.

        Not required if you implement eval() directly.
        """
        return EvalDetail(metric=cls.__name__)

Agent Design Patterns

Pattern 1: Simple Workflow (Like AgentHallucination)

@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
    # Check preconditions
    if cls._has_required_data(input_data):
        # Direct path
        return cls._simple_evaluation(input_data)
    else:
        # Agent workflow with tools
        return cls._agent_workflow(input_data)

Pattern 2: Multi-Step Reasoning

@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
    steps = []

    for i in range(cls.max_iterations):
        # Analyze current state
        analysis = cls._analyze_state(input_data, steps)

        # Decide next action
        action = cls._decide_action(analysis)

        # Execute action (may call tools)
        result = cls._execute_action(action)
        steps.append(result)

        # Check if done
        if result['is_final']:
            break

    return cls._synthesize_result(steps)

Pattern 3: Delegation Pattern

@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
    # Use existing evaluator when appropriate
    if cls._can_use_existing(input_data):
        from dingo.model.llm.existing_model import ExistingModel
        result = ExistingModel.eval(input_data)
        # Add metadata
        result.reason.append("Delegated to ExistingModel")
        return result

    # Otherwise use agent workflow
    return cls._agent_workflow(input_data)

Configuration

Agent Configuration Structure

{
  "evaluator": [{
    "fields": {
      "content": "response",
      "prompt": "question",
      "context": "contexts"
    },
    "evals": [{
      "name": "MyAgent",
      "config": {
        "key": "openai-api-key",
        "api_url": "https://api.openai.com/v1",
        "model": "gpt-4-turbo",
        "parameters": {
          "temperature": 0.1,
          "agent_config": {
            "max_iterations": 3,
            "tools": {
              "my_tool": {
                "api_key": "my-tool-api-key",
                "max_results": 10,
                "timeout": 30
              },
              "another_tool": {
                "config_key": "value"
              }
            }
          }
        }
      }
    }]
  }]
}

Accessing Configuration in Agent

# In your agent class
@classmethod
def some_method(cls):
    # Access LLM configuration
    model = cls.dynamic_config.model  # "gpt-4-turbo"
    temperature = cls.dynamic_config.parameters.get('temperature', 0)

    # Access agent-specific configuration
    agent_config = cls.dynamic_config.parameters.get('agent_config', {})
    max_iterations = agent_config.get('max_iterations', 5)

    # Get tool configuration
    tool_config = cls.get_tool_config('my_tool')
    # Returns: {"api_key": "...", "max_results": 10, "timeout": 30}

Accessing Configuration in Tool

# Configuration is injected automatically via config attribute
@classmethod
def execute(cls, **kwargs):
    api_key = cls.config.api_key  # From tool's config model
    max_results = cls.config.max_results

    # Use configuration...

LangChain 1.0 Agent Configuration

Dingo supports two execution paths for agents:

Legacy Path (default): Manual loop with plan_execution() and aggregate_results()
LangChain Path: Uses LangChain 1.0's create_agent (enable with use_agent_executor = True)

Iteration Limits in LangChain 1.0

In LangChain 1.0, the max_iterations parameter is automatically converted to recursion_limit at runtime:

class MyAgent(BaseAgent):
    use_agent_executor = True  # Enable LangChain path
    max_iterations = 10  # Converted to recursion_limit=10

    _metric_info = {"metric_name": "MyAgent", "description": "..."}

Configuration in JSON:

{
  "name": "MyAgent",
  "config": {
    "parameters": {
      "agent_config": {
        "max_iterations": 10
      }
    }
  }
}

How it works:

max_iterations in config → passed as recursion_limit to LangChain
Default: 25 iterations (LangChain default)
Range: 1-100 (adjust based on task complexity)

Note: LangChain 1.0 uses "recursion_limit" internally, but Dingo maintains the max_iterations terminology for consistency across both execution paths.

Customizing Agent Input: The `_format_agent_input` Extension Point

When using LangChain agents (use_agent_executor = True), you can customize how input data is formatted before being sent to the agent. This is essential for agents that need to work with structured data like prompt, content, and context together.

Default Behavior

By default, BaseAgent passes only input_data.content to LangChain agents:

# Default implementation in BaseAgent
@classmethod
def _format_agent_input(cls, input_data: Data) -> str:
    """Format input data into text for LangChain agent."""
    return input_data.content

Overriding for Custom Formatting

To include additional fields (prompt, context, etc.), override _format_agent_input in your agent:

from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.io import Data

class MyCustomAgent(BaseAgent):
    use_agent_executor = True
    available_tools = ["tavily_search"]

    @classmethod
    def _format_agent_input(cls, input_data: Data) -> str:
        """Format prompt + content + context for agent."""
        parts = []

        # Include prompt if available
        if hasattr(input_data, 'prompt') and input_data.prompt:
            parts.append(f"**Question:**\n{input_data.prompt}")

        # Always include content
        parts.append(f"**Response to Evaluate:**\n{input_data.content}")

        # Include context if available
        if hasattr(input_data, 'context') and input_data.context:
            if isinstance(input_data.context, list):
                context_str = "\n".join(f"- {c}" for c in input_data.context)
            else:
                context_str = str(input_data.context)
            parts.append(f"**Context:**\n{context_str}")
        else:
            parts.append("**Context:** None provided")

        return "\n\n".join(parts)

Best Practices for Input Formatting

Safe Attribute Access: Use hasattr() and check for truthiness

if hasattr(input_data, 'prompt') and input_data.prompt:
    # Safe to use input_data.prompt

Clear Structure: Use markdown-style headers for readability
```
parts.append(f"**Section Name:**\n{content}")
```

Handle Multiple Types: Context might be string or list

if isinstance(input_data.context, list):
    context_str = "\n".join(f"- {c}" for c in input_data.context)
else:
    context_str = str(input_data.context)

Provide Guidance: Tell the agent what to do when data is missing

parts.append("**Context:** None provided - use web search to verify")

Reference Implementation: AgentFactCheck

AgentFactCheck demonstrates a production-ready implementation using _format_agent_input with structured output parsing following LangChain 2025 best practices.

Key Features

Autonomous Search Control: Agent decides when to use web search based on context availability
Structured Output: Uses explicit format instructions for reliable parsing
Robust Error Handling: Multi-layer fallback for parsing agent responses
Context-Aware Prompts: System prompt adapts based on input data
Enhanced Evidence Citation: Extracts and displays source URLs for verification (v1.1)

Implementation Example

from typing import Any, Dict, List
import re
from dingo.io import Data
from dingo.io.input.required_field import RequiredField
from dingo.io.output.eval_detail import EvalDetail, QualityLabel
from dingo.model import Model
from dingo.model.llm.agent.base_agent import BaseAgent

@Model.llm_register("AgentFactCheck")
class AgentFactCheck(BaseAgent):
    """
    LangChain-based fact-checking agent with autonomous search control.

    - With context: Agent MAY use web search for additional verification
    - Without context: Agent MUST use web search to verify facts
    """

    use_agent_executor = True  # Enable LangChain agent
    available_tools = ["tavily_search"]
    max_iterations = 5

    _required_fields = [RequiredField.PROMPT, RequiredField.CONTENT]
    # Note: CONTEXT is optional - agent adapts

    @classmethod
    def _format_agent_input(cls, input_data: Data) -> str:
        """Format prompt + content + context for agent."""
        parts = []

        if hasattr(input_data, 'prompt') and input_data.prompt:
            parts.append(f"**Question:**\n{input_data.prompt}")

        parts.append(f"**Response to Evaluate:**\n{input_data.content}")

        if hasattr(input_data, 'context') and input_data.context:
            if isinstance(input_data.context, list):
                context_str = "\n".join(f"- {c}" for c in input_data.context)
            else:
                context_str = str(input_data.context)
            parts.append(f"**Context:**\n{context_str}")
        else:
            parts.append("**Context:** None provided - use web search to verify")

        return "\n\n".join(parts)

    @classmethod
    def _get_system_prompt(cls, input_data: Data) -> str:
        """System prompt adapts based on context availability."""
        has_context = hasattr(input_data, 'context') and input_data.context

        base_instructions = """You are a fact-checking agent with web search capabilities.

Your task:
1. Analyze the Question and Response provided"""

        if has_context:
            context_instruction = """
2. Context is provided - evaluate the Response against it
3. You MAY use web search for additional verification if needed
4. Make your own decision about whether web search is necessary"""
        else:
            context_instruction = """
2. NO Context is available - you MUST use web search to verify facts
3. Search for reliable sources to fact-check the response"""

        # Following LangChain best practices: explicit output format
        output_format = """

**IMPORTANT: You must return your analysis in exactly this format:**

HALLUCINATION_DETECTED: [YES or NO]
EXPLANATION: [Your detailed analysis]
EVIDENCE: [Supporting sources or facts]
SOURCES: [List of URLs consulted, one per line with - prefix]

Example:
HALLUCINATION_DETECTED: YES
EXPLANATION: The response claims incorrect information.
EVIDENCE: According to reliable sources, this is false.
SOURCES:
- https://example.com/source1
- https://example.com/source2

Be precise and clear. Start your response with "HALLUCINATION_DETECTED:" followed by YES or NO.
Always include SOURCES with specific URLs when you perform web searches."""

        return base_instructions + context_instruction + output_format

    @classmethod
    def aggregate_results(cls, input_data: Data, results: List[Any]) -> EvalDetail:
        """Parse agent output to determine hallucination status."""
        if not results:
            return cls._create_error_result("No results from agent")

        agent_result = results[0]

        if not agent_result.get('success', True):
            error_msg = agent_result.get('error', 'Unknown error')
            return cls._create_error_result(error_msg)

        output = agent_result.get('output', '')

        if not output or not output.strip():
            return cls._create_error_result("Agent returned empty output")

        # Parse structured output
        has_hallucination = cls._detect_hallucination_from_output(output)

        result = EvalDetail(metric=cls.__name__)
        result.status = has_hallucination
        result.label = [
            f"{QualityLabel.QUALITY_BAD_PREFIX}HALLUCINATION"
            if has_hallucination
            else QualityLabel.QUALITY_GOOD
        ]
        result.reason = [
            f"Agent Analysis:\n{output}",
            f"🔍 Web searches: {len(agent_result.get('tool_calls', []))}",
            f"🤖 Reasoning steps: {agent_result.get('reasoning_steps', 0)}"
        ]

        return result

    @classmethod
    def _detect_hallucination_from_output(cls, output: str) -> bool:
        """
        Parse agent output using structured format.

        Strategy:
        1. Regex match for "HALLUCINATION_DETECTED: YES/NO"
        2. Check response start for marker
        3. Fallback to keyword detection
        """
        if not output:
            return False

        # Primary: Regex match
        match = re.search(r'HALLUCINATION_DETECTED:\s*(YES|NO)', output, re.IGNORECASE)
        if match:
            return match.group(1).upper() == 'YES'

        # Fallback: Keyword detection (check negatives first!)
        output_lower = output.lower()

        if any(kw in output_lower for kw in ['no hallucination detected', 'factually accurate']):
            return False
        if any(kw in output_lower for kw in ['hallucination detected', 'factual error']):
            return True

        return False  # Default to no hallucination

    @classmethod
    def _create_error_result(cls, error_message: str) -> EvalDetail:
        """Create error result."""
        result = EvalDetail(metric=cls.__name__)
        result.status = True
        result.label = [f"{QualityLabel.QUALITY_BAD_PREFIX}AGENT_ERROR"]
        result.reason = [f"Agent evaluation failed: {error_message}"]
        return result

    @classmethod
    def plan_execution(cls, input_data: Data) -> List[Dict[str, Any]]:
        """Not used with LangChain agent (agent handles planning)."""
        return []

Why This Pattern Works

Structured Output Format: Explicitly defines expected format in system prompt
Regex Parsing: Reliable primary parsing method
Fallback Layers: Keyword detection as safety net
Error Handling: Returns error status rather than crashing
Context Awareness: Adapts behavior based on available data

Configuration Example

{
  "name": "AgentFactCheck",
  "config": {
    "key": "your-openai-api-key",
    "api_url": "https://api.openai.com/v1",
    "model": "gpt-4-turbo",
    "parameters": {
      "temperature": 0.1,
      "max_tokens": 16384,
      "agent_config": {
        "max_iterations": 5,
        "tools": {
          "tavily_search": {
            "api_key": "your-tavily-api-key",
            "max_results": 5,
            "search_depth": "advanced"
          }
        }
      }
    }
  }
}

Testing AgentFactCheck

from dingo.io import Data
from dingo.model.llm.agent.agent_fact_check import AgentFactCheck

# Test with context
data_with_context = Data(
    prompt="What is the capital of France?",
    content="The capital is Berlin",
    context="France's capital is Paris"
)

# Test without context
data_without_context = Data(
    prompt="What year was Python created?",
    content="Python was created in 1995"
)

# Agent will adapt behavior automatically
result1 = AgentFactCheck.eval(data_with_context)
result2 = AgentFactCheck.eval(data_without_context)

Full implementation: dingo/model/llm/agent/agent_fact_check.py Tests: test/scripts/model/llm/agent/test_agent_fact_check.py (35 tests)

Enhanced Evidence Citation (v1.1)

AgentFactCheck includes a feature to extract and display source URLs from the agent's output, making fact-checking results more transparent and verifiable.

How it works:

System Prompt: Agent is instructed to include a SOURCES section with URLs
Extraction: _extract_sources_from_output() parses the SOURCES section
Display: Sources are appended to the result's reason field

Implementation:

@classmethod
def _extract_sources_from_output(cls, output: str) -> List[str]:
    """Extract source URLs from agent output."""
    sources = []
    in_sources_section = False

    for line in output.split('\n'):
        line = line.strip()

        if line.upper().startswith('SOURCES:'):
            in_sources_section = True
            continue

        if in_sources_section:
            # Check if we've reached a new section
            if line and ':' in line:
                section_header = line.split(':')[0].upper()
                if section_header in ['EXPLANATION', 'EVIDENCE', 'HALLUCINATION_DETECTED']:
                    break

            # Extract URL (with - or • prefix, or direct URL)
            if line.startswith(('- ', '• ', 'http://', 'https://')):
                url = line.lstrip('- •').strip()
                if url:
                    sources.append(url)

    return sources

Usage in aggregate_results:

# Extract sources from output
sources = cls._extract_sources_from_output(output)

# Add sources section to result
result.reason.append("")
if sources:
    result.reason.append("📚 Sources consulted:")
    for source in sources:
        result.reason.append(f"   • {source}")
else:
    result.reason.append("📚 Sources: None explicitly cited")

Benefits:

✅ Increases transparency of agent's fact-checking process
✅ Allows users to verify the agent's judgment independently
✅ Provides attribution for evidence used in evaluation
✅ Meets academic and professional citation standards

Example Output:

Agent Analysis:
HALLUCINATION_DETECTED: YES
EXPLANATION: The response claims the Eiffel Tower is 450 meters tall, but it is actually 330 meters.
EVIDENCE: According to the official Eiffel Tower website, the height is 330 meters including antennas.
SOURCES:
- https://www.toureiffel.paris/en/the-monument
- https://en.wikipedia.org/wiki/Eiffel_Tower

🔍 Web searches performed: 2
🤖 Reasoning steps: 4
⚙️  Agent autonomously decided: Use web search

📚 Sources consulted:
   • https://www.toureiffel.paris/en/the-monument
   • https://en.wikipedia.org/wiki/Eiffel_Tower

Testing

Testing Custom Tools

import pytest
from unittest.mock import patch, MagicMock
from my_tool import MyTool, MyToolConfig

class TestMyTool:

    def setup_method(self):
        """Setup for each test"""
        MyTool.config = MyToolConfig(api_key="test_key")

    def test_successful_execution(self):
        """Test successful tool execution"""
        result = MyTool.execute(query="test query")

        assert result['success'] is True
        assert 'result' in result

    def test_missing_query(self):
        """Test error handling for missing query"""
        result = MyTool.execute()

        assert result['success'] is False
        assert 'Query parameter is required' in result['error']

    @patch('external_api.Client')
    def test_with_mocked_api(self, mock_client):
        """Test with mocked external API"""
        mock_response = {"data": "test"}
        mock_client_instance = MagicMock()
        mock_client_instance.search.return_value = mock_response
        mock_client.return_value = mock_client_instance

        result = MyTool.execute(query="test")

        assert result['success'] is True
        mock_client_instance.search.assert_called_once()

Testing Custom Agents

import pytest
from unittest.mock import patch
from dingo.io import Data
from my_agent import MyAgent
from dingo.config.input_args import EvaluatorLLMArgs

class TestMyAgent:

    def setup_method(self):
        """Setup for each test"""
        MyAgent.dynamic_config = EvaluatorLLMArgs(
            key="test_key",
            api_url="https://api.test.com",
            model="gpt-4"
        )

    def test_agent_registration(self):
        """Test that agent is properly registered"""
        from dingo.model import Model
        Model.load_model()
        assert "MyAgent" in Model.llm_name_map

    @patch.object(MyAgent, 'execute_tool')
    @patch.object(MyAgent, 'send_messages')
    def test_workflow_execution(self, mock_send, mock_tool):
        """Test complete agent workflow"""
        # Mock LLM responses
        mock_send.return_value = "Analysis result"

        # Mock tool responses
        mock_tool.return_value = {
            'success': True,
            'result': 'Tool output'
        }

        # Execute
        data = Data(content="Test content")
        result = MyAgent.eval(data)

        # Verify
        assert result.status is not None
        assert mock_send.called
        assert mock_tool.called

Best Practices

Agent Development

Start Simple: Begin with basic workflow, add complexity as needed
Error Handling: Wrap workflow in try/except, return meaningful error messages
Logging: Use log.info(), log.warning(), log.error() for debugging
Delegation: Reuse existing evaluators when possible
Documentation: Include comprehensive docstrings and configuration examples
Metadata: Add _metric_info for documentation generation

Tool Development

Single Responsibility: Each tool should do one thing well
Configuration: Use Pydantic models with validation
Return Format: Always return dict with success boolean
Error Messages: Provide actionable error messages
Testing: Write unit tests covering success and error cases

Performance

Limit Iterations: Set reasonable max_iterations to prevent infinite loops
Batch Operations: If calling tool multiple times, consider batching
Caching: Consider caching expensive operations
Timeouts: Set appropriate timeouts for external API calls

Security

API Keys: Never hardcode API keys, use configuration
Input Validation: Validate all inputs before passing to external services
Rate Limiting: Respect API rate limits in tools
Error Information: Don't expose sensitive information in error messages

Examples

Complete Example Files

AgentHallucination: dingo/model/llm/agent/agent_hallucination.py - Production agent with web search
AgentFactCheck: examples/agent/agent_executor_example.py - LangChain 1.0 agent example
ArticleFactChecker: dingo/model/llm/agent/agent_article_fact_checker.py - Agent-First with context tracking and artifact saving
ArticleFactChecker Example: examples/agent/agent_article_fact_checking_example.py - Full article fact-checking example
TavilySearch Tool: dingo/model/llm/agent/tools/tavily_search.py - Web search tool implementation
ClaimsExtractor Tool: dingo/model/llm/agent/tools/claims_extractor.py - LLM-based claims extraction tool
ArxivSearch Tool: dingo/model/llm/agent/tools/arxiv_search.py - Academic paper search tool

Note: For complete implementation examples, refer to the files above. They demonstrate real-world patterns for agent and tool development.

Quick Start: Custom Fact Checker

from dingo.model.llm.agent.base_agent import BaseAgent
from dingo.model import Model
from dingo.io import Data
from dingo.io.output.eval_detail import EvalDetail

@Model.llm_register("FactChecker")
class FactChecker(BaseAgent):
    """Simple fact checker using web search"""

    available_tools = ["tavily_search"]
    max_iterations = 1

    @classmethod
    def eval(cls, input_data: Data) -> EvalDetail:
        cls.create_client()

        # Search for facts
        search_result = cls.execute_tool(
            'tavily_search',
            query=input_data.content
        )

        if not search_result['success']:
            return cls._create_error_result("Search failed")

        # Verify with LLM
        prompt = f"""
        Content: {input_data.content}
        Search Results: {search_result['answer']}

        Are there any factual errors? Respond with YES or NO.
        """

        response = cls.send_messages([
            {"role": "user", "content": prompt}
        ])

        result = EvalDetail(metric="FactChecker")
        result.status = "YES" in response.upper()
        result.reason = [f"Verification: {response}"]

        return result

Running Your Agent

from dingo.config import InputArgs
from dingo.exec import Executor

config = {
    "input_path": "data.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "evaluator": [{
        "fields": {"content": "text"},
        "evals": [{
            "name": "FactChecker",
            "config": {
                "key": "openai-key",
                "api_url": "https://api.openai.com/v1",
                "model": "gpt-4",
                "parameters": {
                    "agent_config": {
                        "tools": {
                            "tavily_search": {"api_key": "tavily-key"}
                        }
                    }
                }
            }
        }]
    }]
}

input_args = InputArgs(**config)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

Troubleshooting

Common Issues

Agent not found:

Ensure file is in dingo/model/llm/agent/ directory
Check @Model.llm_register("Name") decorator is present
Run Model.load_model() to trigger auto-discovery

Tool not found:

Ensure @tool_register decorator is present
Check tool name matches string in available_tools
Verify tool file is imported in dingo/model/llm/agent/tools/__init__.py

Configuration not working:

Check JSON structure matches expected format
Verify parameters.agent_config.tools.{tool_name} structure
Use Pydantic validation to catch config errors early

Tests failing:

Patch at correct import path (where object is used, not defined)
Mock external APIs to avoid network calls
Check test isolation (use setup_method to reset state)

Additional Resources

Contributing

When contributing new agents or tools:

Follow existing code style (flake8, isort)
Add comprehensive tests (aim for >80% coverage)
Include docstrings and type hints
Update this guide if adding new patterns
Add examples in examples/agent/
Update metrics documentation in docs/metrics.md

For questions or suggestions, please open an issue on GitHub.

FilesExpand file tree

agent_development_guide.md

Latest commit

History

agent_development_guide.md

File metadata and controls

Agent-Based Evaluation Development Guide

Overview

Table of Contents

Architecture Overview

How Agents Fit in Dingo

Agent Implementation Patterns

Pattern Comparison

Pattern 1: LangChain-Based Agents (Framework-Driven)

When to Use

When NOT to Use

Key Implementation Steps

Example: AgentFactCheck

Pros and Cons

Pattern 2: Custom Workflow Agents (Imperative)

When to Use

When NOT to Use

Key Implementation Steps

Example: AgentHallucination

Pros and Cons

Pattern 3: Agent-First with Context Tracking (ArticleFactChecker)

When to Use

Key Implementation Steps

Async Parallel Execution Pattern

Output Path Access Pattern

Dual-Layer EvalDetail.reason

Decision Tree: Which Pattern Should I Use?

Can I Mix Both Patterns?

Migration Path

From Custom to LangChain

From LangChain to Custom

Creating Custom Tools

Step 1: Define Tool Configuration

Step 2: Implement Tool Class

Tool Best Practices

Creating Custom Agents

Step 1: Create Agent Class

Agent Design Patterns

Pattern 1: Simple Workflow (Like AgentHallucination)

Pattern 2: Multi-Step Reasoning

Pattern 3: Delegation Pattern

Configuration

Agent Configuration Structure

Accessing Configuration in Agent

Accessing Configuration in Tool

LangChain 1.0 Agent Configuration

Iteration Limits in LangChain 1.0

Customizing Agent Input: The _format_agent_input Extension Point

Default Behavior

Overriding for Custom Formatting

Best Practices for Input Formatting

Reference Implementation: AgentFactCheck

Key Features

Implementation Example

Why This Pattern Works

Configuration Example

Testing AgentFactCheck

Enhanced Evidence Citation (v1.1)

Testing

Testing Custom Tools

Testing Custom Agents

Best Practices

Agent Development

Tool Development

Performance

Security

Examples

Complete Example Files

Quick Start: Custom Fact Checker

Running Your Agent

Troubleshooting

Common Issues

Additional Resources

Contributing

Customizing Agent Input: The `_format_agent_input` Extension Point