You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: add ToolCallSequenceMatchSimpleGrader for precision/recall evaluation
- Add ToolCallSequenceMatchSimpleGrader supporting precision/recall metrics
- Support flexible matching with/without arguments
- Add comprehensive test suite with 21 test cases
- Update documentation in overview.md and agent_graders.md
- Emphasize ToolCallSequenceMatchGrader for multi-step complex scenarios
* chore: add uv.lock to .gitignore
* fix for precommit
* fix: fix line length issue in tool_call_sequence_match_simple.py
* fix for precommit
* refactor: replace tool_call_sequence_match with precision_recall_match and step_sequence_match
* fix for pre-commit
Reason: The tool call correctly extracted all required parameters from the user query. The 'pattern' parameter was set to '*.py', which accurately reflects the intent to search for Python files. The 'directory' parameter was set to 'src', matching the specified directory in the query. Both parameters are present, grounded in the query, and formatted correctly as strings. There are no hallucinations or missing parameters, and the data types align with the tool's definition. The tool call is fully executable with correct parameters.
445
446
```
446
447
447
-
### ToolCallSequenceMatchGrader
448
+
### ToolCallStepSequenceMatchGrader
448
449
449
-
Compares agent tool call sequences against reference sequences.
450
+
Evaluates multi-step tool call sequences with step-by-step alignment against reference sequences. Designed for complex multi-turn agent scenarios where tool calls are organized by steps (turns).
450
451
451
452
**Use this grader for:**
452
453
453
-
-Benchmark evaluation against ground truth
454
-
-Trajectory comparisonand validation
455
-
-A/B testing different agent implementations
454
+
-**Multi-step agent benchmarks** — Evaluate agents that make multiple tool calls across different turns
455
+
-**Step-aligned trajectory comparison** — Compare tool sequences where order and grouping by steps matter
456
+
-**Complex agentic workflows** — Validate multi-turn conversations with tool calls organized by assistant turns
456
457
457
-
**Evaluation criteria:** Strict mode matches name + parameters; loose mode matches name only.
458
+
**Evaluation criteria:**Supports step-by-step matching or Jaccard similarity. Strict mode matches name + parameters; loose mode matches name only.
0 commit comments