|
| 1 | +# Graph Trajectory Evaluation Tests |
| 2 | + |
| 3 | +This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The evaluation system tests the ReAct agent's performance across multiple scenarios using: |
| 8 | + |
| 9 | +- **Agent Models**: |
| 10 | + - `siliconflow:Qwen/Qwen3-8B` |
| 11 | + - `siliconflow:THUDM/GLM-4-9B-0414` |
| 12 | + |
| 13 | +- **Evaluator Model**: |
| 14 | + - `siliconflow:deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` (expert data labeler) |
| 15 | + |
| 16 | +- **Evaluation Method**: |
| 17 | + - Graph trajectory LLM-as-judge with async execution |
| 18 | + - Boolean scoring (true/false) with expert data labeler prompt |
| 19 | + - Custom JSON output guardrails for structured evaluation |
| 20 | + |
| 21 | +## Files |
| 22 | + |
| 23 | +- `graph.py` - Main evaluation implementation with async graph trajectory evaluation |
| 24 | +- `utils.py` - Utility functions for evaluation helpers and metrics |
| 25 | +- `conftest.py` - Pytest configuration and fixtures for evaluation tests |
| 26 | +- `README.md` - This documentation file |
| 27 | + |
| 28 | +## Requirements |
| 29 | + |
| 30 | +Before running evaluations, ensure you have the required environment variables set: |
| 31 | + |
| 32 | +```bash |
| 33 | +# Required for all evaluations |
| 34 | +export TAVILY_API_KEY="your_tavily_api_key" |
| 35 | +export SILICONFLOW_API_KEY="your_siliconflow_api_key" # For both agents and evaluator |
| 36 | + |
| 37 | +# Optional: Set region for SiliconFlow API |
| 38 | +export REGION="prc" # or "international" |
| 39 | +``` |
| 40 | + |
| 41 | +## Running Evaluations |
| 42 | + |
| 43 | +### Quick Test |
| 44 | + |
| 45 | +Run a single evaluation test: |
| 46 | + |
| 47 | +```bash |
| 48 | +make test_evaluations |
| 49 | +``` |
| 50 | + |
| 51 | +Or use pytest directly: |
| 52 | + |
| 53 | +```bash |
| 54 | +uv run python -m pytest tests/evaluations/ -v |
| 55 | +``` |
| 56 | + |
| 57 | +### Comprehensive Evaluation |
| 58 | + |
| 59 | +Run the full evaluation suite across all models and scenarios (6 total combinations): |
| 60 | + |
| 61 | +```bash |
| 62 | +uv run python -m pytest tests/evaluations/graph.py::test_graph_trajectory_evaluation -v |
| 63 | +``` |
| 64 | + |
| 65 | +This runs **2 models × 3 scenarios = 6 parameterized test combinations**. |
| 66 | + |
| 67 | +## Test Scenarios |
| 68 | + |
| 69 | +The evaluation includes these test scenarios: |
| 70 | + |
| 71 | +1. **Simple Question** - Direct factual queries that don't require tool usage |
| 72 | +2. **Search Required** - Queries requiring web search for current information |
| 73 | +3. **Multi-step Reasoning** - Complex queries requiring both search and structured analysis |
| 74 | + |
| 75 | +## Evaluation Criteria |
| 76 | + |
| 77 | +Each agent trajectory is evaluated using the **expert data labeler** methodology: |
| 78 | + |
| 79 | +### Rubric |
| 80 | +An accurate trajectory: |
| 81 | +- Makes logical sense between steps |
| 82 | +- Shows clear progression |
| 83 | +- Is relatively efficient, though it does not need to be perfectly efficient |
| 84 | +- Is semantically equivalent to the provided reference trajectory, if present |
| 85 | + |
| 86 | +### Scoring |
| 87 | +- **Boolean scoring**: `true` (passes evaluation) or `false` (fails evaluation) |
| 88 | +- **JSON output**: `{"score": true/false, "reasoning": "explanation"}` |
| 89 | +- **Assertion**: All tests must return `true` to pass |
| 90 | + |
| 91 | +## Configuration |
| 92 | + |
| 93 | +Modify evaluation parameters in `graph.py`: |
| 94 | + |
| 95 | +- `AGENT_MODELS`: List of models to test as agents (currently 2 models) |
| 96 | +- `EVALUATOR_MODEL`: Model to use as the LLM judge (DeepSeek R1) |
| 97 | +- `TEST_SCENARIOS`: Test cases with queries and expected behaviors (currently 3 scenarios) |
| 98 | + |
| 99 | +## Test Architecture |
| 100 | + |
| 101 | +### Parameterized Testing |
| 102 | +- Uses `@pytest.mark.parametrize` to create all model-scenario combinations |
| 103 | +- **Total combinations**: 2 models × 3 scenarios = 6 unique tests |
| 104 | +- Each combination runs exactly once with unique thread IDs |
| 105 | + |
| 106 | +### Key Features |
| 107 | +- **Custom prompt**: Expert data labeler with JSON output guardrails |
| 108 | +- **Boolean assertions**: Each evaluation must return `true` to pass |
| 109 | +- **Trajectory normalization**: Handles LangChain message serialization |
| 110 | +- **Error handling**: Graceful handling of API failures and missing keys |
| 111 | +- **Async execution**: Efficient concurrent evaluation |
| 112 | + |
| 113 | +## Output |
| 114 | + |
| 115 | +Evaluation results include: |
| 116 | + |
| 117 | +- **Individual test results**: Boolean pass/fail with reasoning for each model-scenario combination |
| 118 | +- **Pytest summary**: Clear pass/fail status for all 6 combinations |
| 119 | +- **Execution time**: Total time for complete evaluation suite |
| 120 | +- **Detailed logging**: Model names, scenarios, scores, and reasoning text |
| 121 | + |
| 122 | +## Notes |
| 123 | + |
| 124 | +- **No global test skipping**: Individual tests check their required environment variables |
| 125 | +- **Environment validation**: Tests skip gracefully when API keys are missing |
| 126 | +- **Async implementation**: Efficient execution across multiple model-scenario combinations |
| 127 | +- **Production ready**: All linting issues resolved, clean codebase |
0 commit comments