Skip to content

Commit de9ed4f

Browse files
committed
✨ feat: integrate LangSmith pytest for evaluation system
- Refactor graph trajectory evaluation to use LangSmith integration - Replace score assertions with real performance metrics - Add scenario-specific evaluators with custom rubrics - Preserve original async graph trajectory LLM-as-judge approach - Update documentation with LangSmith integration guide
1 parent b9b1ae3 commit de9ed4f

File tree

5 files changed

+392
-119
lines changed

5 files changed

+392
-119
lines changed

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ test_e2e:
2323
test_evaluations:
2424
uv run python -m pytest tests/evaluations/ -v
2525

26+
test_eval_graph:
27+
uv run python -m pytest -n auto tests/evaluations/graph.py -v
28+
2629
test_all:
2730
uv run python -m pytest tests/
2831

@@ -113,6 +116,7 @@ help:
113116
@echo 'test_integration - run integration tests only'
114117
@echo 'test_e2e - run e2e tests only'
115118
@echo 'test_evaluations - run graph trajectory evaluation tests'
119+
@echo 'test_eval_graph - run graph evaluations in parallel (fast)'
116120
@echo 'test_all - run all tests (unit + integration + e2e)'
117121
@echo 'test_watch - run unit tests in watch mode'
118122
@echo 'test_watch_unit - run unit tests in watch mode'

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,4 +78,6 @@ dev = [
7878
"ruff>=0.9.10",
7979
"openevals>=0.1.0",
8080
"agentevals>=0.0.9",
81+
"langsmith[pytest]>=0.4.16",
82+
"pytest-xdist>=3.8.0",
8183
]

tests/evaluations/README.md

Lines changed: 106 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# Graph Trajectory Evaluation Tests
22

3-
This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology.
3+
This directory contains evaluation tests for the ReAct agent using the AgentEvals framework with Graph trajectory LLM-as-judge methodology and LangSmith pytest integration.
4+
5+
## References
6+
7+
- [AgentEvals Graph Trajectory LLM-as-Judge](https://github.com/langchain-ai/agentevals/blob/main/README.md#graph-trajectory-llm-as-judge)
8+
- [AgentEvals LangSmith Integration](https://github.com/langchain-ai/agentevals/blob/main/README.md#langsmith-integration)
9+
- [LangSmith Pytest Documentation](https://docs.langchain.com/langsmith/pytest)
410

511
## Overview
612

@@ -11,12 +17,13 @@ The evaluation system tests the ReAct agent's performance across multiple scenar
1117
- `siliconflow:THUDM/GLM-4-9B-0414`
1218

1319
- **Evaluator Model**:
14-
- `siliconflow:deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` (expert data labeler)
20+
- `siliconflow:THUDM/GLM-Z1-9B-0414` (advanced reasoning model for evaluation)
1521

1622
- **Evaluation Method**:
17-
- Graph trajectory LLM-as-judge with async execution
18-
- Boolean scoring (true/false) with expert data labeler prompt
19-
- Custom JSON output guardrails for structured evaluation
23+
- [Graph trajectory LLM-as-judge](https://github.com/langchain-ai/agentevals/blob/main/README.md#graph-trajectory-llm-as-judge) with async execution
24+
- [LangSmith pytest integration](https://docs.langchain.com/langsmith/pytest) for comprehensive tracking
25+
- Scenario-specific rubrics and evaluation criteria
26+
- Real performance metrics instead of pass/fail assertions
2027

2128
## Files
2229

@@ -34,67 +41,107 @@ Before running evaluations, ensure you have the required environment variables s
3441
export TAVILY_API_KEY="your_tavily_api_key"
3542
export SILICONFLOW_API_KEY="your_siliconflow_api_key" # For both agents and evaluator
3643

44+
# Required for LangSmith integration
45+
export LANGSMITH_API_KEY="your_langsmith_api_key"
46+
export LANGSMITH_TRACING="true"
47+
3748
# Optional: Set region for SiliconFlow API
3849
export REGION="prc" # or "international"
3950
```
4051

4152
## Running Evaluations
4253

43-
### Quick Test
54+
### Graph Evaluation Tests
4455

45-
Run a single evaluation test:
56+
Run the ReAct agent graph trajectory evaluation tests:
4657

4758
```bash
48-
make test_evaluations
59+
# Using Makefile (recommended) - Fast parallel execution
60+
make test_eval_graph
61+
62+
# Or run directly with pytest
63+
pytest -n auto tests/evaluations/graph.py -v
4964
```
5065

51-
Or use pytest directly:
66+
### Specific Test Filtering
67+
68+
Run specific scenarios or models:
5269

5370
```bash
54-
uv run python -m pytest tests/evaluations/ -v
71+
# Run specific scenario
72+
pytest -n auto tests/evaluations/graph.py -k "simple_question" -v
73+
74+
# Run specific model
75+
pytest -n auto tests/evaluations/graph.py -k "Qwen3-8B" -v
76+
77+
# Run all combinations (default)
78+
make test_eval_graph
5579
```
5680

57-
### Comprehensive Evaluation
81+
This runs **2 models × 3 scenarios = 6 parameterized test combinations** in parallel (~1-2 minutes).
82+
83+
### LangSmith Integration (Optional)
5884

59-
Run the full evaluation suite across all models and scenarios (6 total combinations):
85+
For detailed evaluation tracking and analysis:
6086

6187
```bash
62-
uv run python -m pytest tests/evaluations/graph.py::test_graph_trajectory_evaluation -v
88+
# Sequential execution with LangSmith dashboard
89+
LANGSMITH_TRACING=true pytest tests/evaluations/graph.py --langsmith-output -v
6390
```
6491

65-
This runs **2 models × 3 scenarios = 6 parameterized test combinations**.
92+
**Note**: LangSmith integration requires sequential execution and takes longer (~4-5 minutes).
6693

6794
## Test Scenarios
6895

69-
The evaluation includes these test scenarios:
70-
71-
1. **Simple Question** - Direct factual queries that don't require tool usage
72-
2. **Search Required** - Queries requiring web search for current information
73-
3. **Multi-step Reasoning** - Complex queries requiring both search and structured analysis
96+
The evaluation includes these test scenarios with scenario-specific rubrics:
97+
98+
1. **Simple Question** (`simple_question`)
99+
- **Query**: "What is the capital of France?"
100+
- **Expected**: Direct answer without unnecessary tool usage
101+
- **Rubric**: Evaluates efficiency and appropriate confidence for basic facts
102+
- **Example Results**:
103+
-**Fail**: [Agent used tools unnecessarily](https://smith.langchain.com/public/cde5921c-48fc-46a7-a8bb-6e8d31821a6f/r) - Trajectory shows `tools` node for basic factual question (Score: 0)
104+
-**Success**: [Agent answered directly](https://smith.langchain.com/public/a965ad02-d4ac-4c87-8cb1-8b717ba3ca97/r) - Trajectory shows only `call_model` without tools (Score: 1)
105+
106+
2. **Search Required** (`search_required`)
107+
- **Query**: "What's the latest news about artificial intelligence?"
108+
- **Expected**: Uses search tools to find current information
109+
- **Rubric**: Evaluates search tool usage and information synthesis
110+
- **Example Results**:
111+
-**Fail**: [Agent provided generic content with links](https://smith.langchain.com/public/5b796d70-cf73-441c-a278-ff9d2493ecf2/r) - Used tools but gave generic summaries and link lists instead of specific current news (Score: 0)
112+
-**Success**: [Agent synthesized actual current information](https://smith.langchain.com/public/708fb561-92f1-482a-aef4-f26df874822d/r) - Used tools and provided specific recent developments with concrete details (Score: 1)
113+
114+
3. **Multi-step Reasoning** (`multi_step_reasoning`)
115+
- **Query**: "What are the pros and cons of renewable energy, and what are the latest developments?"
116+
- **Expected**: Search for information and provide structured analysis
117+
- **Rubric**: Evaluates complex analytical tasks and comprehensive research
118+
- **Example Results**:
119+
-**Success**: [Agent performed search and analytical synthesis](https://smith.langchain.com/public/59157ed9-d185-4e3f-99dd-d898a18a4178/r) - Used tools to gather current information and provided structured pros/cons analysis with recent developments (Score: 1)
120+
-**Potential Failures**: Agents that provide only generic pros/cons without search, or use tools but lack structured analysis of current developments
74121

75122
## Evaluation Criteria
76123

77-
Each agent trajectory is evaluated using the **expert data labeler** methodology:
124+
Each agent trajectory is evaluated using **scenario-specific rubrics** with LangSmith integration:
78125

79-
### Rubric
80-
An accurate trajectory:
81-
- Makes logical sense between steps
82-
- Shows clear progression
83-
- Is relatively efficient, though it does not need to be perfectly efficient
84-
- Is semantically equivalent to the provided reference trajectory, if present
126+
### Evaluation Approach
127+
- **Scenario-specific evaluators**: Each test scenario has custom evaluation criteria
128+
- **Async graph trajectory evaluation**: Uses [`create_async_graph_trajectory_llm_as_judge`](https://github.com/langchain-ai/agentevals/blob/main/README.md#graph-trajectory-llm-as-judge)
129+
- **Real performance metrics**: Actual scores (0.0, 0.5, 1.0) instead of pass/fail assertions
130+
- **LangSmith tracking**: All inputs, outputs, and evaluation results logged via [pytest integration](https://docs.langchain.com/langsmith/pytest)
85131

86132
### Scoring
87-
- **Boolean scoring**: `true` (passes evaluation) or `false` (fails evaluation)
88-
- **JSON output**: `{"score": true/false, "reasoning": "explanation"}`
89-
- **Assertion**: All tests must return `true` to pass
133+
- **Real evaluation scores**: Reflects actual agent performance
134+
- **Detailed reasoning**: Explanation for each evaluation decision
135+
- **No artificial assertions**: Tests don't fail based on evaluation scores
136+
- **Comprehensive feedback**: Available in LangSmith dashboard for analysis
90137

91138
## Configuration
92139

93140
Modify evaluation parameters in `graph.py`:
94141

95142
- `AGENT_MODELS`: List of models to test as agents (currently 2 models)
96-
- `EVALUATOR_MODEL`: Model to use as the LLM judge (DeepSeek R1)
97-
- `TEST_SCENARIOS`: Test cases with queries and expected behaviors (currently 3 scenarios)
143+
- `EVALUATOR_MODEL`: Model to use as the LLM judge (`siliconflow:THUDM/GLM-Z1-9B-0414`)
144+
- `TEST_SCENARIOS`: Test cases with queries, expected behaviors, and custom rubrics (currently 3 scenarios)
98145

99146
## Test Architecture
100147

@@ -104,24 +151,40 @@ Modify evaluation parameters in `graph.py`:
104151
- Each combination runs exactly once with unique thread IDs
105152

106153
### Key Features
107-
- **Custom prompt**: Expert data labeler with JSON output guardrails
108-
- **Boolean assertions**: Each evaluation must return `true` to pass
109-
- **Trajectory normalization**: Handles LangChain message serialization
110-
- **Error handling**: Graceful handling of API failures and missing keys
111-
- **Async execution**: Efficient concurrent evaluation
154+
- **LangSmith Integration**: Full [pytest integration](https://docs.langchain.com/langsmith/pytest) with `@pytest.mark.langsmith`
155+
- **Scenario-specific rubrics**: Custom evaluation criteria for each test scenario
156+
- **Real performance metrics**: Actual evaluation scores instead of artificial assertions
157+
- **Comprehensive logging**: Inputs, outputs, reference outputs, and evaluation feedback
158+
- **Async execution**: Efficient concurrent evaluation with [AgentEvals graph trajectory approach](https://github.com/langchain-ai/agentevals/blob/main/README.md#graph-trajectory-llm-as-judge)
112159

113160
## Output
114161

115162
Evaluation results include:
116163

117-
- **Individual test results**: Boolean pass/fail with reasoning for each model-scenario combination
118-
- **Pytest summary**: Clear pass/fail status for all 6 combinations
119-
- **Execution time**: Total time for complete evaluation suite
120-
- **Detailed logging**: Model names, scenarios, scores, and reasoning text
164+
### LangSmith Dashboard
165+
- **LangSmith URL**: Generated for each test run with detailed analytics
166+
- **Comprehensive tracking**: All test inputs, outputs, and evaluation feedback
167+
- **Performance metrics**: Real evaluation scores and reasoning for analysis
168+
- **Historical comparison**: Track performance over time across models and scenarios
169+
170+
### Test Output
171+
- **Individual test results**: Real evaluation scores (0.0-1.0) with detailed reasoning
172+
- **Test status**: All tests pass (no artificial score assertions)
173+
- **Execution time**: Sequential ~4-5 minutes, Parallel ~1-2 minutes
174+
- **Detailed logging**: Model names, scenarios, scores, and evaluation explanations
175+
- **Input format**: Shows actual question text (e.g., "What is the capital of France?")
176+
177+
## Benefits of LangSmith Integration
178+
179+
- **Real Performance Metrics**: Evaluation scores reflect actual agent capabilities
180+
- **Comprehensive Tracking**: All test data stored in LangSmith for detailed analysis
181+
- **Scenario Customization**: Each test scenario has appropriate evaluation criteria
182+
- **Historical Analysis**: Track performance trends across different model versions
183+
- **No Artificial Assertions**: Tests focus on measurement rather than pass/fail
121184

122185
## Notes
123186

124-
- **No global test skipping**: Individual tests check their required environment variables
187+
- **LangSmith Required**: Tests require `LANGSMITH_API_KEY` environment variable
125188
- **Environment validation**: Tests skip gracefully when API keys are missing
126-
- **Async implementation**: Efficient execution across multiple model-scenario combinations
127-
- **Production ready**: All linting issues resolved, clean codebase
189+
- **Async implementation**: Uses original async graph trajectory evaluators
190+
- **Production ready**: All linting issues resolved, comprehensive evaluation system

0 commit comments

Comments
 (0)