✨ feat: initialize ReAct agent evaluation system

webup · webup · commit 5e6b1cdbdde7 · 2025-09-04T00:05:50.000+08:00
- Add AgentEvals-based graph trajectory evaluation framework
- Implement parameterized testing (2 models × 3 scenarios = 6 tests)
- Use expert data labeler methodology with boolean scoring
- Add custom JSON output guardrails for structured evaluation
- Support async execution for performance optimization

#### Key Features
- DeepSeek R1 evaluator with custom prompt template
- Boolean pass/fail scoring (assert score == true)
- Trajectory normalization for LangChain message serialization
- Individual test environment validation (no global skipping)
- Comprehensive test coverage across agent models and scenarios

#### Test Coverage
- Agent models: Qwen3-8B, GLM-4-9B-0414 (via SiliconFlow)
- Evaluator: DeepSeek-R1-0528-Qwen3-8B
- Scenarios: simple questions, search-required, multi-step reasoning
- All 6 combinations validate trajectory quality and efficiency

#### Infrastructure
- Add agentevals and openevals dependencies
- Update Makefile with test_evaluations target
- Clean up unused imports and global pytest skipping
- Add comprehensive README documentation
diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: all format lint test test_unit test_integration test_e2e test_all test_watch test_watch_unit test_watch_integration test_watch_e2e test_profile extended_tests dev dev_ui
+.PHONY: all format lint test test_unit test_integration test_e2e test_all test_evaluations test_watch test_watch_unit test_watch_integration test_watch_e2e test_profile extended_tests dev dev_ui
 
 # Default target executed when no arguments are given to make.
 all: help
@@ -20,6 +20,9 @@ test_integration:
 test_e2e:
 	uv run python -m pytest tests/e2e_tests/
 
+test_evaluations:
+	uv run python -m pytest tests/evaluations/ -v
+
 test_all:
 	uv run python -m pytest tests/
 
@@ -109,6 +112,7 @@ help:
 	@echo 'test_unit                    - run unit tests only'
 	@echo 'test_integration             - run integration tests only'
 	@echo 'test_e2e                     - run e2e tests only'
+	@echo 'test_evaluations             - run graph trajectory evaluation tests'
 	@echo 'test_all                     - run all tests (unit + integration + e2e)'
 	@echo 'test_watch                   - run unit tests in watch mode'
 	@echo 'test_watch_unit              - run unit tests in watch mode'
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ version = "0.1.2"
 description = "Starter template for making a custom Reasoning and Action agent (using tool calling) in LangGraph."
 authors = [
     { name = "Haili Zhang", email = "haili.zhang@outlook.com" },
-    { name = "Claude", email = "claude@anthropic.com" },
+    { name = "Claude Code", email = "claude-code@anthropic.com" },
 ]
 readme = "README.md"
 license = { text = "MIT" }
@@ -75,4 +75,6 @@ dev = [
     "langgraph-sdk>=0.1.0",
     "mypy>=1.17.1",
     "ruff>=0.9.10",
+    "openevals>=0.1.0",
+    "agentevals>=0.0.9",
 ]
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -1,6 +1,5 @@
 """Pytest configuration and shared fixtures."""
 
-import os
 from pathlib import Path
 
 import pytest
@@ -22,13 +21,8 @@ def load_env():
     if env_file.exists():
         load_dotenv(env_file)
 
-    # Ensure required environment variables are available for tests
-    # You can add fallback values or skip tests if keys are missing
-    required_keys = ["DASHSCOPE_API_KEY", "TAVILY_API_KEY", "SILICONFLOW_API_KEY"]
-    missing_keys = [key for key in required_keys if not os.getenv(key)]
-
-    if missing_keys:
-        pytest.skip(f"Missing required environment variables: {missing_keys}")
+    # Note: Individual tests will check for their specific required keys
+    # and skip appropriately. We don't globally skip all tests here.
 
 
 @pytest.fixture
diff --git a/tests/evaluations/README.md b/tests/evaluations/README.md
@@ -0,0 +1,127 @@
+# Graph Trajectory Evaluation Tests
+
+This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology.
+
+## Overview
+
+The evaluation system tests the ReAct agent's performance across multiple scenarios using:
+
+- **Agent Models**: 
+  - `siliconflow:Qwen/Qwen3-8B` 
+  - `siliconflow:THUDM/GLM-4-9B-0414`
+  
+- **Evaluator Model**: 
+  - `siliconflow:deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` (expert data labeler)
+
+- **Evaluation Method**: 
+  - Graph trajectory LLM-as-judge with async execution
+  - Boolean scoring (true/false) with expert data labeler prompt
+  - Custom JSON output guardrails for structured evaluation
+
+## Files
+
+- `graph.py` - Main evaluation implementation with async graph trajectory evaluation
+- `utils.py` - Utility functions for evaluation helpers and metrics
+- `conftest.py` - Pytest configuration and fixtures for evaluation tests
+- `README.md` - This documentation file
+
+## Requirements
+
+Before running evaluations, ensure you have the required environment variables set:
+
+```bash
+# Required for all evaluations
+export TAVILY_API_KEY="your_tavily_api_key"
+export SILICONFLOW_API_KEY="your_siliconflow_api_key"  # For both agents and evaluator
+
+# Optional: Set region for SiliconFlow API
+export REGION="prc"  # or "international"
+```
+
+## Running Evaluations
+
+### Quick Test
+
+Run a single evaluation test:
+
+```bash
+make test_evaluations
+```
+
+Or use pytest directly:
+
+```bash
+uv run python -m pytest tests/evaluations/ -v
+```
+
+### Comprehensive Evaluation
+
+Run the full evaluation suite across all models and scenarios (6 total combinations):
+
+```bash
+uv run python -m pytest tests/evaluations/graph.py::test_graph_trajectory_evaluation -v
+```
+
+This runs **2 models × 3 scenarios = 6 parameterized test combinations**.
+
+## Test Scenarios
+
+The evaluation includes these test scenarios:
+
+1. **Simple Question** - Direct factual queries that don't require tool usage
+2. **Search Required** - Queries requiring web search for current information  
+3. **Multi-step Reasoning** - Complex queries requiring both search and structured analysis
+
+## Evaluation Criteria
+
+Each agent trajectory is evaluated using the **expert data labeler** methodology:
+
+### Rubric
+An accurate trajectory:
+- Makes logical sense between steps
+- Shows clear progression  
+- Is relatively efficient, though it does not need to be perfectly efficient
+- Is semantically equivalent to the provided reference trajectory, if present
+
+### Scoring
+- **Boolean scoring**: `true` (passes evaluation) or `false` (fails evaluation)
+- **JSON output**: `{"score": true/false, "reasoning": "explanation"}`
+- **Assertion**: All tests must return `true` to pass
+
+## Configuration
+
+Modify evaluation parameters in `graph.py`:
+
+- `AGENT_MODELS`: List of models to test as agents (currently 2 models)
+- `EVALUATOR_MODEL`: Model to use as the LLM judge (DeepSeek R1)
+- `TEST_SCENARIOS`: Test cases with queries and expected behaviors (currently 3 scenarios)
+
+## Test Architecture
+
+### Parameterized Testing
+- Uses `@pytest.mark.parametrize` to create all model-scenario combinations
+- **Total combinations**: 2 models × 3 scenarios = 6 unique tests
+- Each combination runs exactly once with unique thread IDs
+
+### Key Features
+- **Custom prompt**: Expert data labeler with JSON output guardrails
+- **Boolean assertions**: Each evaluation must return `true` to pass
+- **Trajectory normalization**: Handles LangChain message serialization 
+- **Error handling**: Graceful handling of API failures and missing keys
+- **Async execution**: Efficient concurrent evaluation
+
+## Output
+
+Evaluation results include:
+
+- **Individual test results**: Boolean pass/fail with reasoning for each model-scenario combination
+- **Pytest summary**: Clear pass/fail status for all 6 combinations
+- **Execution time**: Total time for complete evaluation suite
+- **Detailed logging**: Model names, scenarios, scores, and reasoning text
+
+## Notes
+
+- **No global test skipping**: Individual tests check their required environment variables
+- **Environment validation**: Tests skip gracefully when API keys are missing
+- **Async implementation**: Efficient execution across multiple model-scenario combinations  
+- **Production ready**: All linting issues resolved, clean codebase
diff --git a/tests/evaluations/__init__.py b/tests/evaluations/__init__.py
@@ -0,0 +1 @@
+"""Evaluation tests for the ReAct agent using AgentEvals framework."""
diff --git a/tests/evaluations/conftest.py b/tests/evaluations/conftest.py
@@ -0,0 +1,32 @@
+"""Configuration for evaluation tests."""
+
+import pytest
+
+
+def pytest_configure(config):
+    """Configure pytest for evaluation tests."""
+    # Register custom markers
+    config.addinivalue_line(
+        "markers", 
+        "slow: marks tests as slow (deselect with '-m \"not slow\"')"
+    )
+    config.addinivalue_line(
+        "markers",
+        "evaluation: marks tests as evaluation tests"
+    )
+
+
+@pytest.fixture(scope="session")
+def evaluation_config():
+    """Fixture providing evaluation configuration."""
+    return {
+        "timeout": 300,  # 5 minute timeout for evaluation tests
+        "retry_count": 2,  # Retry failed evaluations
+    }
+
+
+@pytest.fixture(scope="function")
+def evaluation_thread_id(request):
+    """Generate unique thread ID for evaluation tests."""
+    test_name = request.node.name
+    return f"eval_test_{test_name}_{id(request)}"
diff --git a/tests/evaluations/graph.py b/tests/evaluations/graph.py
diff --git a/tests/evaluations/utils.py b/tests/evaluations/utils.py
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""Evaluation tests for the ReAct agent using AgentEvals framework."""`