You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology.
3
+
This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology and LangSmith pytest integration.
1.**Simple Question** - Direct factual queries that don't require tool usage
72
-
2.**Search Required** - Queries requiring web search for current information
73
-
3.**Multi-step Reasoning** - Complex queries requiring both search and structured analysis
96
+
The evaluation includes these test scenarios with scenario-specific rubrics:
97
+
98
+
1.**Simple Question** (`simple_question`)
99
+
-**Query**: "What is the capital of France?"
100
+
-**Expected**: Direct answer without unnecessary tool usage
101
+
-**Rubric**: Evaluates efficiency and appropriate confidence for basic facts
102
+
-**Example Results**:
103
+
- ❌ **Fail**: [Agent used tools unnecessarily](https://smith.langchain.com/public/cde5921c-48fc-46a7-a8bb-6e8d31821a6f/r) - Trajectory shows `tools` node for basic factual question (Score: 0)
104
+
- ✅ **Success**: [Agent answered directly](https://smith.langchain.com/public/a965ad02-d4ac-4c87-8cb1-8b717ba3ca97/r) - Trajectory shows only `call_model` without tools (Score: 1)
105
+
106
+
2.**Search Required** (`search_required`)
107
+
-**Query**: "What's the latest news about artificial intelligence?"
108
+
-**Expected**: Uses search tools to find current information
109
+
-**Rubric**: Evaluates search tool usage and information synthesis
110
+
-**Example Results**:
111
+
- ❌ **Fail**: [Agent provided generic content with links](https://smith.langchain.com/public/5b796d70-cf73-441c-a278-ff9d2493ecf2/r) - Used tools but gave generic summaries and link lists instead of specific current news (Score: 0)
112
+
- ✅ **Success**: [Agent synthesized actual current information](https://smith.langchain.com/public/708fb561-92f1-482a-aef4-f26df874822d/r) - Used tools and provided specific recent developments with concrete details (Score: 1)
-**Query**: "What are the pros and cons of renewable energy, and what are the latest developments?"
116
+
-**Expected**: Search for information and provide structured analysis
117
+
-**Rubric**: Evaluates complex analytical tasks and comprehensive research
118
+
-**Example Results**:
119
+
- ✅ **Success**: [Agent performed search and analytical synthesis](https://smith.langchain.com/public/59157ed9-d185-4e3f-99dd-d898a18a4178/r) - Used tools to gather current information and provided structured pros/cons analysis with recent developments (Score: 1)
120
+
- ❌ **Potential Failures**: Agents that provide only generic pros/cons without search, or use tools but lack structured analysis of current developments
74
121
75
122
## Evaluation Criteria
76
123
77
-
Each agent trajectory is evaluated using the **expert data labeler**methodology:
124
+
Each agent trajectory is evaluated using **scenario-specific rubrics**with LangSmith integration:
78
125
79
-
### Rubric
80
-
An accurate trajectory:
81
-
- Makes logical sense between steps
82
-
- Shows clear progression
83
-
- Is relatively efficient, though it does not need to be perfectly efficient
84
-
- Is semantically equivalent to the provided reference trajectory, if present
126
+
### Evaluation Approach
127
+
-**Scenario-specific evaluators**: Each test scenario has custom evaluation criteria
0 commit comments