Skip to content

Commit 595f1fd

Browse files
authored
rename: ToolCallSequenceMatchGrader ->ToolCallStepSequenceMatchGrader (#68)
* feat: add ToolCallSequenceMatchSimpleGrader for precision/recall evaluation - Add ToolCallSequenceMatchSimpleGrader supporting precision/recall metrics - Support flexible matching with/without arguments - Add comprehensive test suite with 21 test cases - Update documentation in overview.md and agent_graders.md - Emphasize ToolCallSequenceMatchGrader for multi-step complex scenarios * chore: add uv.lock to .gitignore * fix for precommit * fix: fix line length issue in tool_call_sequence_match_simple.py * fix for precommit * refactor: replace tool_call_sequence_match with precision_recall_match and step_sequence_match * fix for pre-commit
1 parent 717fb74 commit 595f1fd

File tree

7 files changed

+763
-32
lines changed

7 files changed

+763
-32
lines changed

docs/built_in_graders/agent_graders.md

Lines changed: 97 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ Evaluate AI agent behavior across actions, tools, memory, planning, reflection,
1212
| | `ToolCallAccuracyGrader` | Evaluates tool call accuracy | LLM-Based | 1-5 | API-based assistants |
1313
| | `ToolCallSuccessGrader` | Checks technical execution success | LLM-Based | {0, 1} | Production agent monitoring |
1414
| | `ToolParameterCheckGrader` | Validates parameter correctness | LLM-Based | {0, 1} | Slot-filling dialogues |
15-
| | `ToolCallSequenceMatchGrader` | Compares tool call sequences | Code-Based | [0, 1] | Benchmark evaluation |
15+
| | `ToolCallStepSequenceMatchGrader` | Multi-step tool sequence matching with step alignment | Code-Based | [0, 1] | Complex multi-turn agent benchmarks |
16+
| | `ToolCallPrecisionRecallMatchGrader` | Simple precision/recall for flat tool call lists | Code-Based | [0, 1] | Single-step tool call evaluation |
1617
| **Memory** | `MemoryAccuracyGrader` | Validates memory factuality | LLM-Based | {0, 1} | Memory-augmented agents |
1718
| | `MemoryDetailPreservationGrader` | Checks detail retention | LLM-Based | {0, 1} | Long-horizon tasks |
1819
| | `MemoryRetrievalEffectivenessGrader` | Assesses memory retrieval | LLM-Based | {0, 1} | RAG-based agents |
@@ -444,17 +445,17 @@ Score: 1.0
444445
Reason: The tool call correctly extracted all required parameters from the user query. The 'pattern' parameter was set to '*.py', which accurately reflects the intent to search for Python files. The 'directory' parameter was set to 'src', matching the specified directory in the query. Both parameters are present, grounded in the query, and formatted correctly as strings. There are no hallucinations or missing parameters, and the data types align with the tool's definition. The tool call is fully executable with correct parameters.
445446
```
446447

447-
### ToolCallSequenceMatchGrader
448+
### ToolCallStepSequenceMatchGrader
448449

449-
Compares agent tool call sequences against reference sequences.
450+
Evaluates multi-step tool call sequences with step-by-step alignment against reference sequences. Designed for complex multi-turn agent scenarios where tool calls are organized by steps (turns).
450451

451452
**Use this grader for:**
452453

453-
- Benchmark evaluation against ground truth
454-
- Trajectory comparison and validation
455-
- A/B testing different agent implementations
454+
- **Multi-step agent benchmarks** — Evaluate agents that make multiple tool calls across different turns
455+
- **Step-aligned trajectory comparison** — Compare tool sequences where order and grouping by steps matter
456+
- **Complex agentic workflows** — Validate multi-turn conversations with tool calls organized by assistant turns
456457

457-
**Evaluation criteria:** Strict mode matches name + parameters; loose mode matches name only.
458+
**Evaluation criteria:** Supports step-by-step matching or Jaccard similarity. Strict mode matches name + parameters; loose mode matches name only.
458459

459460
**Parameters:**
460461

@@ -475,10 +476,10 @@ Compares agent tool call sequences against reference sequences.
475476

476477
```python
477478
import asyncio
478-
from openjudge.graders.agent import ToolCallSequenceMatchGrader
479+
from openjudge.graders.agent import ToolCallStepSequenceMatchGrader
479480

480481
async def main():
481-
grader = ToolCallSequenceMatchGrader(
482+
grader = ToolCallStepSequenceMatchGrader(
482483
strict_mode=True,
483484
use_jaccard_similarity=True
484485
)
@@ -514,6 +515,93 @@ Score: 1.0
514515
Reason: Tool call sequence evaluation (strict mode, jaccard): jaccard_similarity=1.000
515516
```
516517

518+
### ToolCallPrecisionRecallMatchGrader
519+
520+
Computes precision or recall metrics for tool calls against reference.
521+
522+
**Use this grader for:**
523+
524+
- Simple tool call evaluation without step/order consideration
525+
- Computing precision (correctness of predictions) or recall (coverage of reference)
526+
- Flexible matching with or without argument comparison
527+
528+
**Evaluation criteria:** Compares predicted tool calls against reference tool calls using set-based matching.
529+
530+
**Parameters:**
531+
532+
| Parameter | Type | Required | Description |
533+
|-----------|------|----------|-------------|
534+
| `tool_calls` | List[Dict[str, Any]] | Yes | Predicted tool calls to evaluate |
535+
| `reference_tool_calls` | List[Dict[str, Any]] | Yes | Ground truth reference tool calls |
536+
| `metric_type` | str | No | "precision" or "recall" (default: "recall") |
537+
| `match_arguments` | bool | No | Match name + arguments (True) or name only (False), default: False |
538+
539+
**Scoring:**
540+
- **Precision**: Correct predictions / Total predictions
541+
- **Recall**: Correct predictions / Total references
542+
- Range: 0.0 (no match) to 1.0 (perfect match)
543+
544+
**Example:**
545+
546+
```python
547+
import asyncio
548+
from openjudge.graders.agent import ToolCallPrecisionRecallMatchGrader
549+
550+
async def main():
551+
# Compute recall with loose matching (name only)
552+
grader = ToolCallPrecisionRecallMatchGrader(
553+
metric_type="recall",
554+
match_arguments=False
555+
)
556+
557+
tool_calls = [
558+
{"name": "search", "arguments": {"query": "python"}},
559+
{"name": "calculate", "arguments": {"expr": "1+1"}}
560+
]
561+
562+
reference_tool_calls = [
563+
{"name": "search", "arguments": {"query": "test"}},
564+
{"name": "calculate", "arguments": {"expr": "2+2"}},
565+
{"name": "send_email", "arguments": {"to": "user@example.com"}}
566+
]
567+
568+
result = await grader.aevaluate(
569+
tool_calls=tool_calls,
570+
reference_tool_calls=reference_tool_calls
571+
)
572+
573+
print(f"Score: {result.score}") # 0.667 - 2 out of 3 references matched
574+
print(f"Precision: {result.metadata['precision']}")
575+
print(f"Recall: {result.metadata['recall']}")
576+
577+
asyncio.run(main())
578+
```
579+
580+
**Output:**
581+
582+
```
583+
Score: 0.6666666666666666
584+
Precision: 1.0
585+
Recall: 0.6666666666666666
586+
```
587+
588+
**Strict matching example:**
589+
590+
```python
591+
# Compute precision with strict matching (name + arguments)
592+
grader = ToolCallPrecisionRecallMatchGrader(
593+
metric_type="precision",
594+
match_arguments=True
595+
)
596+
597+
result = await grader.aevaluate(
598+
tool_calls=[{"name": "search", "arguments": {"query": "python"}}],
599+
reference_tool_calls=[{"name": "search", "arguments": {"query": "python"}}]
600+
)
601+
602+
print(f"Score: {result.score}") # 1.0 - exact match
603+
```
604+
517605

518606
## Memory Graders
519607

docs/built_in_graders/overview.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@ Comprehensive evaluation for AI agents across the entire lifecycle. [→ Detaile
5555
|--------|-------------|------|-------------|
5656
| `ToolSelectionGrader` | Evaluates appropriateness of tool selection | LLM-Based | 1-5 |
5757
| `ToolCallAccuracyGrader` | Checks tool call correctness | LLM-Based | 1-5 |
58-
| `ToolCallSequenceMatchGrader` | Validates tool call sequence | Code-Based | {0, 1} |
58+
| `ToolCallStepSequenceMatchGrader` | Multi-step tool sequence matching with step alignment for complex multi-turn agents | Code-Based | [0, 1] |
59+
| `ToolCallPrecisionRecallMatchGrader` | Simple precision/recall for flat tool call lists (single-step scenarios) | Code-Based | [0, 1] |
5960
| `ToolCallSuccessGrader` | Checks if tool calls succeeded | LLM-Based | {0, 1} |
6061
| `ToolParameterCheckGrader` | Validates tool parameters | LLM-Based | {0, 1} |
6162

openjudge/graders/agent/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,10 @@
2323
)
2424
from .reflection.reflection_progress_awareness import ReflectionProgressAwarenessGrader
2525
from .tool.tool_call_accuracy import ToolCallAccuracyGrader
26+
from .tool.tool_call_precision_recall_match import ToolCallPrecisionRecallMatchGrader
2627

2728
# Tool graders
28-
from .tool.tool_call_sequence_match import ToolCallSequenceMatchGrader
29+
from .tool.tool_call_step_sequence_match import ToolCallStepSequenceMatchGrader
2930
from .tool.tool_call_success import ToolCallSuccessGrader
3031
from .tool.tool_parameter_check import ToolParameterCheckGrader
3132
from .tool.tool_selection import ToolSelectionGrader

0 commit comments

Comments
 (0)