[Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality

## Problem

LlamaIndex has solid evaluation for RAG (`AnswerRelevancyEvaluator`, `FaithfulnessEvaluator`, etc.) but nothing for evaluating agent behavior. The instrumentation module already captures agent events (`AgentToolCallEvent`, `AgentRunStepStartEvent/EndEvent`) but no evaluator consumes them.

As more teams build with `AgentWorkflow`, `ReActAgent`, and `FunctionAgent`, there's no built-in way to answer:

- Did the agent call the right tool with the right arguments?
- Did it follow system prompt instructions?
- Was the reasoning chain sound (for ReAct)?
- Did it stop at the right time instead of looping?

RAGAS and DeepEval both ship agent-specific evaluators (ToolCallAccuracy, TaskCompletion). LlamaIndex could have native equivalents that work directly with its instrumentation and agent types.

## Proposal

Add agent evaluators under `llama_index.core.evaluation.agent/` that follow the existing `BaseEvaluator` pattern.

### Evaluators

**ToolCallCorrectnessEvaluator** - Given a user query and expected tool calls, score whether the agent called the correct tools with correct arguments. Works with any agent type.

**InstructionAdherenceEvaluator** - LLM-judged: did the agent's response follow the system prompt constraints? Uses the same judge LLM pattern as `GuidelineEvaluator`.

**ReasoningQualityEvaluator** - For ReAct agents: evaluates whether the reasoning steps (thought/action/observation chain) are logically sound and lead to the correct conclusion.

**AgentGoalSuccessEvaluator** - End-to-end: given a task description and the agent's final output, did the agent accomplish the goal?

### Interface

Building on the existing `BaseEvaluator`:

```python
from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...
```

The `tool_calls` and `expected_tool_calls` parameters extend the base interface. Existing `BatchEvalRunner` would work for running these at scale.

## Scope

I'd start with `ToolCallCorrectnessEvaluator` and `AgentGoalSuccessEvaluator` as a first PR, since they cover the most common evaluation need (did the agent do the right thing?). The other two can follow.

Happy to take this on. I've been contributing to the repo (security fixes, rate limiting).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality #20862

Problem

Proposal

Evaluators

Interface

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality #20862

Description

Problem

Proposal

Evaluators

Interface

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions