Skip to content

Commit 5e6b1cd

Browse files
committed
✨ feat: initialize ReAct agent evaluation system
- Add AgentEvals-based graph trajectory evaluation framework - Implement parameterized testing (2 models × 3 scenarios = 6 tests) - Use expert data labeler methodology with boolean scoring - Add custom JSON output guardrails for structured evaluation - Support async execution for performance optimization #### Key Features - DeepSeek R1 evaluator with custom prompt template - Boolean pass/fail scoring (assert score == true) - Trajectory normalization for LangChain message serialization - Individual test environment validation (no global skipping) - Comprehensive test coverage across agent models and scenarios #### Test Coverage - Agent models: Qwen3-8B, GLM-4-9B-0414 (via SiliconFlow) - Evaluator: DeepSeek-R1-0528-Qwen3-8B - Scenarios: simple questions, search-required, multi-step reasoning - All 6 combinations validate trajectory quality and efficiency #### Infrastructure - Add agentevals and openevals dependencies - Update Makefile with test_evaluations target - Clean up unused imports and global pytest skipping - Add comprehensive README documentation
1 parent 469d07e commit 5e6b1cd

File tree

9 files changed

+692
-10
lines changed

9 files changed

+692
-10
lines changed

Makefile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: all format lint test test_unit test_integration test_e2e test_all test_watch test_watch_unit test_watch_integration test_watch_e2e test_profile extended_tests dev dev_ui
1+
.PHONY: all format lint test test_unit test_integration test_e2e test_all test_evaluations test_watch test_watch_unit test_watch_integration test_watch_e2e test_profile extended_tests dev dev_ui
22

33
# Default target executed when no arguments are given to make.
44
all: help
@@ -20,6 +20,9 @@ test_integration:
2020
test_e2e:
2121
uv run python -m pytest tests/e2e_tests/
2222

23+
test_evaluations:
24+
uv run python -m pytest tests/evaluations/ -v
25+
2326
test_all:
2427
uv run python -m pytest tests/
2528

@@ -109,6 +112,7 @@ help:
109112
@echo 'test_unit - run unit tests only'
110113
@echo 'test_integration - run integration tests only'
111114
@echo 'test_e2e - run e2e tests only'
115+
@echo 'test_evaluations - run graph trajectory evaluation tests'
112116
@echo 'test_all - run all tests (unit + integration + e2e)'
113117
@echo 'test_watch - run unit tests in watch mode'
114118
@echo 'test_watch_unit - run unit tests in watch mode'

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ version = "0.1.2"
44
description = "Starter template for making a custom Reasoning and Action agent (using tool calling) in LangGraph."
55
authors = [
66
{ name = "Haili Zhang", email = "[email protected]" },
7-
{ name = "Claude", email = "[email protected]" },
7+
{ name = "Claude Code", email = "claude-code@anthropic.com" },
88
]
99
readme = "README.md"
1010
license = { text = "MIT" }
@@ -75,4 +75,6 @@ dev = [
7575
"langgraph-sdk>=0.1.0",
7676
"mypy>=1.17.1",
7777
"ruff>=0.9.10",
78+
"openevals>=0.1.0",
79+
"agentevals>=0.0.9",
7880
]

tests/conftest.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
"""Pytest configuration and shared fixtures."""
22

3-
import os
43
from pathlib import Path
54

65
import pytest
@@ -22,13 +21,8 @@ def load_env():
2221
if env_file.exists():
2322
load_dotenv(env_file)
2423

25-
# Ensure required environment variables are available for tests
26-
# You can add fallback values or skip tests if keys are missing
27-
required_keys = ["DASHSCOPE_API_KEY", "TAVILY_API_KEY", "SILICONFLOW_API_KEY"]
28-
missing_keys = [key for key in required_keys if not os.getenv(key)]
29-
30-
if missing_keys:
31-
pytest.skip(f"Missing required environment variables: {missing_keys}")
24+
# Note: Individual tests will check for their specific required keys
25+
# and skip appropriately. We don't globally skip all tests here.
3226

3327

3428
@pytest.fixture

tests/evaluations/README.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Graph Trajectory Evaluation Tests
2+
3+
This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology.
4+
5+
## Overview
6+
7+
The evaluation system tests the ReAct agent's performance across multiple scenarios using:
8+
9+
- **Agent Models**:
10+
- `siliconflow:Qwen/Qwen3-8B`
11+
- `siliconflow:THUDM/GLM-4-9B-0414`
12+
13+
- **Evaluator Model**:
14+
- `siliconflow:deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` (expert data labeler)
15+
16+
- **Evaluation Method**:
17+
- Graph trajectory LLM-as-judge with async execution
18+
- Boolean scoring (true/false) with expert data labeler prompt
19+
- Custom JSON output guardrails for structured evaluation
20+
21+
## Files
22+
23+
- `graph.py` - Main evaluation implementation with async graph trajectory evaluation
24+
- `utils.py` - Utility functions for evaluation helpers and metrics
25+
- `conftest.py` - Pytest configuration and fixtures for evaluation tests
26+
- `README.md` - This documentation file
27+
28+
## Requirements
29+
30+
Before running evaluations, ensure you have the required environment variables set:
31+
32+
```bash
33+
# Required for all evaluations
34+
export TAVILY_API_KEY="your_tavily_api_key"
35+
export SILICONFLOW_API_KEY="your_siliconflow_api_key" # For both agents and evaluator
36+
37+
# Optional: Set region for SiliconFlow API
38+
export REGION="prc" # or "international"
39+
```
40+
41+
## Running Evaluations
42+
43+
### Quick Test
44+
45+
Run a single evaluation test:
46+
47+
```bash
48+
make test_evaluations
49+
```
50+
51+
Or use pytest directly:
52+
53+
```bash
54+
uv run python -m pytest tests/evaluations/ -v
55+
```
56+
57+
### Comprehensive Evaluation
58+
59+
Run the full evaluation suite across all models and scenarios (6 total combinations):
60+
61+
```bash
62+
uv run python -m pytest tests/evaluations/graph.py::test_graph_trajectory_evaluation -v
63+
```
64+
65+
This runs **2 models × 3 scenarios = 6 parameterized test combinations**.
66+
67+
## Test Scenarios
68+
69+
The evaluation includes these test scenarios:
70+
71+
1. **Simple Question** - Direct factual queries that don't require tool usage
72+
2. **Search Required** - Queries requiring web search for current information
73+
3. **Multi-step Reasoning** - Complex queries requiring both search and structured analysis
74+
75+
## Evaluation Criteria
76+
77+
Each agent trajectory is evaluated using the **expert data labeler** methodology:
78+
79+
### Rubric
80+
An accurate trajectory:
81+
- Makes logical sense between steps
82+
- Shows clear progression
83+
- Is relatively efficient, though it does not need to be perfectly efficient
84+
- Is semantically equivalent to the provided reference trajectory, if present
85+
86+
### Scoring
87+
- **Boolean scoring**: `true` (passes evaluation) or `false` (fails evaluation)
88+
- **JSON output**: `{"score": true/false, "reasoning": "explanation"}`
89+
- **Assertion**: All tests must return `true` to pass
90+
91+
## Configuration
92+
93+
Modify evaluation parameters in `graph.py`:
94+
95+
- `AGENT_MODELS`: List of models to test as agents (currently 2 models)
96+
- `EVALUATOR_MODEL`: Model to use as the LLM judge (DeepSeek R1)
97+
- `TEST_SCENARIOS`: Test cases with queries and expected behaviors (currently 3 scenarios)
98+
99+
## Test Architecture
100+
101+
### Parameterized Testing
102+
- Uses `@pytest.mark.parametrize` to create all model-scenario combinations
103+
- **Total combinations**: 2 models × 3 scenarios = 6 unique tests
104+
- Each combination runs exactly once with unique thread IDs
105+
106+
### Key Features
107+
- **Custom prompt**: Expert data labeler with JSON output guardrails
108+
- **Boolean assertions**: Each evaluation must return `true` to pass
109+
- **Trajectory normalization**: Handles LangChain message serialization
110+
- **Error handling**: Graceful handling of API failures and missing keys
111+
- **Async execution**: Efficient concurrent evaluation
112+
113+
## Output
114+
115+
Evaluation results include:
116+
117+
- **Individual test results**: Boolean pass/fail with reasoning for each model-scenario combination
118+
- **Pytest summary**: Clear pass/fail status for all 6 combinations
119+
- **Execution time**: Total time for complete evaluation suite
120+
- **Detailed logging**: Model names, scenarios, scores, and reasoning text
121+
122+
## Notes
123+
124+
- **No global test skipping**: Individual tests check their required environment variables
125+
- **Environment validation**: Tests skip gracefully when API keys are missing
126+
- **Async implementation**: Efficient execution across multiple model-scenario combinations
127+
- **Production ready**: All linting issues resolved, clean codebase

tests/evaluations/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Evaluation tests for the ReAct agent using AgentEvals framework."""

tests/evaluations/conftest.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""Configuration for evaluation tests."""
2+
3+
import pytest
4+
5+
6+
def pytest_configure(config):
7+
"""Configure pytest for evaluation tests."""
8+
# Register custom markers
9+
config.addinivalue_line(
10+
"markers",
11+
"slow: marks tests as slow (deselect with '-m \"not slow\"')"
12+
)
13+
config.addinivalue_line(
14+
"markers",
15+
"evaluation: marks tests as evaluation tests"
16+
)
17+
18+
19+
@pytest.fixture(scope="session")
20+
def evaluation_config():
21+
"""Fixture providing evaluation configuration."""
22+
return {
23+
"timeout": 300, # 5 minute timeout for evaluation tests
24+
"retry_count": 2, # Retry failed evaluations
25+
}
26+
27+
28+
@pytest.fixture(scope="function")
29+
def evaluation_thread_id(request):
30+
"""Generate unique thread ID for evaluation tests."""
31+
test_name = request.node.name
32+
return f"eval_test_{test_name}_{id(request)}"

0 commit comments

Comments
 (0)