-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
Problem
LlamaIndex has solid evaluation for RAG (AnswerRelevancyEvaluator, FaithfulnessEvaluator, etc.) but nothing for evaluating agent behavior. The instrumentation module already captures agent events (AgentToolCallEvent, AgentRunStepStartEvent/EndEvent) but no evaluator consumes them.
As more teams build with AgentWorkflow, ReActAgent, and FunctionAgent, there's no built-in way to answer:
- Did the agent call the right tool with the right arguments?
- Did it follow system prompt instructions?
- Was the reasoning chain sound (for ReAct)?
- Did it stop at the right time instead of looping?
RAGAS and DeepEval both ship agent-specific evaluators (ToolCallAccuracy, TaskCompletion). LlamaIndex could have native equivalents that work directly with its instrumentation and agent types.
Proposal
Add agent evaluators under llama_index.core.evaluation.agent/ that follow the existing BaseEvaluator pattern.
Evaluators
ToolCallCorrectnessEvaluator - Given a user query and expected tool calls, score whether the agent called the correct tools with correct arguments. Works with any agent type.
InstructionAdherenceEvaluator - LLM-judged: did the agent's response follow the system prompt constraints? Uses the same judge LLM pattern as GuidelineEvaluator.
ReasoningQualityEvaluator - For ReAct agents: evaluates whether the reasoning steps (thought/action/observation chain) are logically sound and lead to the correct conclusion.
AgentGoalSuccessEvaluator - End-to-end: given a task description and the agent's final output, did the agent accomplish the goal?
Interface
Building on the existing BaseEvaluator:
from llama_index.core.evaluation import BaseEvaluator, EvaluationResult
class ToolCallCorrectnessEvaluator(BaseEvaluator):
async def aevaluate(
self,
query: str | None = None,
response: str | None = None,
contexts: Sequence[str] | None = None,
tool_calls: list[ToolCall] | None = None,
expected_tool_calls: list[ToolCall] | None = None,
**kwargs,
) -> EvaluationResult:
...The tool_calls and expected_tool_calls parameters extend the base interface. Existing BatchEvalRunner would work for running these at scale.
Scope
I'd start with ToolCallCorrectnessEvaluator and AgentGoalSuccessEvaluator as a first PR, since they cover the most common evaluation need (did the agent do the right thing?). The other two can follow.
Happy to take this on. I've been contributing to the repo (security fixes, rate limiting).