Skip to content

[Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality #20862

@debu-sinha

Description

@debu-sinha

Problem

LlamaIndex has solid evaluation for RAG (AnswerRelevancyEvaluator, FaithfulnessEvaluator, etc.) but nothing for evaluating agent behavior. The instrumentation module already captures agent events (AgentToolCallEvent, AgentRunStepStartEvent/EndEvent) but no evaluator consumes them.

As more teams build with AgentWorkflow, ReActAgent, and FunctionAgent, there's no built-in way to answer:

  • Did the agent call the right tool with the right arguments?
  • Did it follow system prompt instructions?
  • Was the reasoning chain sound (for ReAct)?
  • Did it stop at the right time instead of looping?

RAGAS and DeepEval both ship agent-specific evaluators (ToolCallAccuracy, TaskCompletion). LlamaIndex could have native equivalents that work directly with its instrumentation and agent types.

Proposal

Add agent evaluators under llama_index.core.evaluation.agent/ that follow the existing BaseEvaluator pattern.

Evaluators

ToolCallCorrectnessEvaluator - Given a user query and expected tool calls, score whether the agent called the correct tools with correct arguments. Works with any agent type.

InstructionAdherenceEvaluator - LLM-judged: did the agent's response follow the system prompt constraints? Uses the same judge LLM pattern as GuidelineEvaluator.

ReasoningQualityEvaluator - For ReAct agents: evaluates whether the reasoning steps (thought/action/observation chain) are logically sound and lead to the correct conclusion.

AgentGoalSuccessEvaluator - End-to-end: given a task description and the agent's final output, did the agent accomplish the goal?

Interface

Building on the existing BaseEvaluator:

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...

The tool_calls and expected_tool_calls parameters extend the base interface. Existing BatchEvalRunner would work for running these at scale.

Scope

I'd start with ToolCallCorrectnessEvaluator and AgentGoalSuccessEvaluator as a first PR, since they cover the most common evaluation need (did the agent do the right thing?). The other two can follow.

Happy to take this on. I've been contributing to the repo (security fixes, rate limiting).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions