Skip to content

Commit 33db7c4

Browse files
feat: add AgenticGrader and SearchCorrectnessGrader with tool support (#82)
* feat: add AgenticGrader and SearchCorrectnessGrader with tool support * fix: improve error handling in agentic_grader and refactor search_correctness tests * fix: sync test assertion with updated error message in openai_chat_model * feat(agentic): polish agentic grader code and add cookbook examples - Add openjudge/agentic module with polished code: - tools.py: BaseTool, ToolResult with improved type hints and docstrings - agents.py: BaseAgent, AgentResult, ReActAgent with _normalize_tool_call method - adapters/function.py: FunctionToolAdapter for wrapping Python functions - adapters/langchain.py: LangChainToolAdapter, LangChainAgentAdapter - adapters/agentscope.py: AgentScopeToolAdapter, AgentScopeAgentAdapter - Add agentic_grader cookbook examples: - 01_native_react_native_tool.py: Built-in ReActAgent + Native Tool - 02_native_react_langchain_tool.py: Built-in ReActAgent + LangChain Tool - 03_langchain_agent.py: LangChain Agent (Full Delegation) - 04_agentscope_agent.py: AgentScope Agent (Full Delegation) - README.md: Documentation with architecture diagram and selection guide - Update agentic_grader.py to use new agentic module - Update search_correctness.py to use new agentic module * fix: rename aevaluate to _aevaluate to match BaseGrader interface - AgenticGrader._aevaluate: implement abstract method from BaseGrader - SearchCorrectnessGrader._aevaluate: call parent's _aevaluate instead of aevaluate * refactor(agentic): unified interface design for AgenticGrader ## Changes ### Core Design: Unified Interface - AgenticGrader now only accepts a pre-built `agent` parameter - Removed model/tools parameters from AgenticGrader.__init__ - Agent must be constructed externally (ReActAgent or adapters) ### BaseAgent Enhancement - Support dict config for model parameter (auto-converts to OpenAIChatModel) - Updated type hints: Union[BaseChatModel, Dict[str, Any]] ### Adapters Reorganization - Moved LangChain/AgentScope adapters to cookbooks/agentic_grader/adapters/ - Core library only provides interface definitions (BaseAgent, BaseTool) - Avoids circular dependencies when submitting PRs to external frameworks ### Cookbook Examples - Simplified examples with cleaner structure - All examples follow unified pattern: build agent first, then create grader - Added ASReActAgent alias to avoid naming conflict with OpenJudge ReActAgent ### Documentation - Updated docstrings to reflect unified interface design - Fixed import paths in examples - Added Note in from_config() explaining it's a convenience method ### Test Updates - Updated test_search_correctness.py to use grader.agent.model instead of grader.model
1 parent b0b96d0 commit 33db7c4

File tree

18 files changed

+3196
-5
lines changed

18 files changed

+3196
-5
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Example 1: Built-in ReActAgent + Native Tool (Zero Dependencies)
4+
5+
Dependencies: pip install openjudge tavily-python
6+
Environment: OPENAI_API_KEY, TAVILY_API_KEY
7+
"""
8+
9+
import asyncio
10+
import os
11+
12+
from tavily import TavilyClient
13+
14+
from openjudge.agentic import BaseTool, ReActAgent, ToolResult
15+
from openjudge.graders.agentic_grader import AgenticGrader
16+
17+
18+
class TavilySearchTool(BaseTool):
19+
"""Web search tool using Tavily API."""
20+
21+
schema = {
22+
"type": "function",
23+
"function": {
24+
"name": "web_search",
25+
"description": "Search the web for information to verify facts.",
26+
"parameters": {
27+
"type": "object",
28+
"properties": {
29+
"query": {
30+
"type": "string",
31+
"description": "The search query string.",
32+
}
33+
},
34+
"required": ["query"],
35+
},
36+
},
37+
}
38+
39+
def __init__(self):
40+
self._client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
41+
42+
async def aexecute(self, query: str, **kwargs) -> ToolResult:
43+
response = self._client.search(query=query, max_results=3)
44+
results = [f"[{i}] {r['title']}: {r['content'][:200]}..." for i, r in enumerate(response.get("results", []), 1)]
45+
return ToolResult(success=True, output="\n".join(results))
46+
47+
48+
# Create Grader
49+
grader = AgenticGrader(
50+
agent=ReActAgent(
51+
model={"model": "qwen3-32b", "api_key": os.getenv("OPENAI_API_KEY")},
52+
tools=[TavilySearchTool()],
53+
max_iterations=5,
54+
),
55+
template="""
56+
You are a fact-checking assistant. Your task is to verify the factual accuracy of the given response.
57+
58+
**Question:** {query}
59+
**Response to evaluate:** {response}
60+
61+
Instructions:
62+
1. Identify the key factual claims in the response.
63+
2. Use the web_search tool to verify each claim against reliable sources.
64+
3. Compare the search results with the claims in the response.
65+
4. Provide a score from 1-5 based on factual accuracy.
66+
67+
Scoring Criteria:
68+
- 5: All factual claims are verified and accurate.
69+
- 4: Core facts are correct with minor inaccuracies.
70+
- 3: Partially correct, some claims are wrong.
71+
- 2: Core facts contradict search results.
72+
- 1: Completely inaccurate or fabricated.
73+
74+
Output your evaluation in JSON format:
75+
{{"score": <1-5>, "reason": "<your detailed reasoning with search evidence>"}}
76+
""",
77+
name="native_correctness_grader",
78+
)
79+
80+
# Evaluate
81+
if __name__ == "__main__":
82+
query = """Please introduce BYD company, including:
83+
1. Full company name and stock code
84+
2. Main business areas and core technologies
85+
3. 2024 NEV sales volume and global ranking
86+
4. Key markets and popular models"""
87+
88+
response = """BYD Company Limited (Stock: A-share 002594, H-share 1211) is a global leader in new energy vehicles.
89+
90+
**Main Business:**
91+
The company focuses on NEVs, power batteries, and rail transit. It masters core technologies including batteries, motors, and electronic controls, being the only company globally that masters both battery and vehicle manufacturing.
92+
93+
**2024 Performance:**
94+
- NEV sales: ~4.26 million units, #1 globally for two consecutive years
95+
- Revenue: ~620 billion RMB
96+
- Power battery installations: Top 3 globally
97+
98+
**Global Presence & Models:**
99+
- China: Dynasty series (Han, Tang, Song), Ocean series (Seal, Seagull)
100+
- Europe: Germany, France, UK
101+
- Southeast Asia: Thailand, Singapore, Malaysia
102+
- South America: Factory in Brazil
103+
104+
BYD has become a global leader in the NEV industry through innovation and globalization."""
105+
106+
result = asyncio.run(grader.aevaluate(query=query, response=response))
107+
print(f"Score: {result.score}")
108+
print(f"Reason: {result.reason}")
109+
print(f"Tool calls: {result.metadata.get('tool_calls', 0)}")
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Example 2: Built-in ReActAgent + LangChain Tool
4+
5+
Dependencies: pip install openjudge langchain-tavily
6+
Environment: OPENAI_API_KEY, TAVILY_API_KEY
7+
"""
8+
9+
import asyncio
10+
import os
11+
12+
from langchain_tavily import TavilySearch
13+
14+
from cookbooks.agentic_grader.adapters.langchain import LangChainToolAdapter
15+
from openjudge.agentic import ReActAgent
16+
from openjudge.graders.agentic_grader import AgenticGrader
17+
18+
# Create Grader
19+
grader = AgenticGrader(
20+
agent=ReActAgent(
21+
model={"model": "qwen3-32b", "api_key": os.getenv("OPENAI_API_KEY")},
22+
tools=[LangChainToolAdapter(TavilySearch(max_results=3))],
23+
max_iterations=5,
24+
),
25+
template="""
26+
You are a fact-checking assistant. Your task is to verify the factual accuracy of the given response.
27+
28+
**Question:** {query}
29+
**Response to evaluate:** {response}
30+
31+
Instructions:
32+
1. Identify the key factual claims in the response.
33+
2. Use the available search tool to verify each claim against reliable sources.
34+
3. Compare the search results with the claims in the response.
35+
4. Provide a score from 1-5 based on factual accuracy.
36+
37+
Scoring Criteria:
38+
- 5: All factual claims are verified and accurate.
39+
- 4: Core facts are correct with minor inaccuracies.
40+
- 3: Partially correct, some claims are wrong.
41+
- 2: Core facts contradict search results.
42+
- 1: Completely inaccurate or fabricated.
43+
44+
Output your evaluation in JSON format:
45+
{{"score": <1-5>, "reason": "<your detailed reasoning with search evidence>"}}
46+
""",
47+
name="langchain_tool_correctness_grader",
48+
)
49+
50+
# Evaluate
51+
if __name__ == "__main__":
52+
query = """Please introduce BYD company, including:
53+
1. Full company name and stock code
54+
2. Main business areas and core technologies
55+
3. 2024 NEV sales volume and global ranking
56+
4. Key markets and popular models"""
57+
58+
response = """BYD Company Limited (Stock: A-share 002594, H-share 1211) is a global leader in new energy vehicles.
59+
60+
**Main Business:**
61+
The company focuses on NEVs, power batteries, and rail transit. It masters core technologies including batteries, motors, and electronic controls, being the only company globally that masters both battery and vehicle manufacturing.
62+
63+
**2024 Performance:**
64+
- NEV sales: ~4.26 million units, #1 globally for two consecutive years
65+
- Revenue: ~620 billion RMB
66+
- Power battery installations: Top 3 globally
67+
68+
**Global Presence & Models:**
69+
- China: Dynasty series (Han, Tang, Song), Ocean series (Seal, Seagull)
70+
- Europe: Germany, France, UK
71+
- Southeast Asia: Thailand, Singapore, Malaysia
72+
- South America: Factory in Brazil
73+
74+
BYD has become a global leader in the NEV industry through innovation and globalization."""
75+
76+
result = asyncio.run(grader.aevaluate(query=query, response=response))
77+
print(f"Score: {result.score}")
78+
print(f"Reason: {result.reason}")
79+
print(f"Tool calls: {result.metadata.get('tool_calls', 0)}")
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Example 3: LangChain Agent (Full Delegation)
4+
5+
Dependencies: pip install openjudge langchain langchain-openai langchain-tavily
6+
Environment: OPENAI_API_KEY, TAVILY_API_KEY
7+
"""
8+
9+
import asyncio
10+
import os
11+
12+
from langchain.agents import create_agent
13+
from langchain_openai import ChatOpenAI
14+
from langchain_tavily import TavilySearch
15+
16+
from cookbooks.agentic_grader.adapters.langchain import LangChainAgentAdapter
17+
from openjudge.graders.agentic_grader import AgenticGrader
18+
19+
# Create LangChain Agent
20+
# Note: qwen3 model needs to close thinking mode
21+
lc_agent = create_agent(
22+
ChatOpenAI(
23+
model="qwen3-32b",
24+
api_key=os.getenv("OPENAI_API_KEY"),
25+
extra_body={"enable_thinking": False},
26+
),
27+
[TavilySearch(max_results=3)],
28+
)
29+
30+
# Create Grader
31+
grader = AgenticGrader(
32+
agent=LangChainAgentAdapter(lc_agent),
33+
template="""
34+
You are a fact-checking assistant. Your task is to verify the factual accuracy of the given response.
35+
36+
**Question:** {query}
37+
**Response to evaluate:** {response}
38+
39+
Instructions:
40+
1. Identify the key factual claims in the response.
41+
2. Use the available search tool to verify each claim against reliable sources.
42+
3. Compare the search results with the claims in the response.
43+
4. Provide a score from 1-5 based on factual accuracy.
44+
45+
Scoring Criteria:
46+
- 5: All factual claims are verified and accurate.
47+
- 4: Core facts are correct with minor inaccuracies.
48+
- 3: Partially correct, some claims are wrong.
49+
- 2: Core facts contradict search results.
50+
- 1: Completely inaccurate or fabricated.
51+
52+
Output your evaluation in JSON format:
53+
{{"score": <1-5>, "reason": "<your detailed reasoning with search evidence>"}}
54+
""",
55+
name="langchain_agent_correctness_grader",
56+
)
57+
58+
# 评估
59+
if __name__ == "__main__":
60+
query = """Please introduce BYD company, including:
61+
1. Full company name and stock code
62+
2. Main business areas and core technologies
63+
3. 2024 NEV sales volume and global ranking
64+
4. Key markets and popular models"""
65+
66+
response = """BYD Company Limited (Stock: A-share 002594, H-share 1211) is a global leader in new energy vehicles.
67+
68+
**Main Business:**
69+
The company focuses on NEVs, power batteries, and rail transit. It masters core technologies including batteries, motors, and electronic controls, being the only company globally that masters both battery and vehicle manufacturing.
70+
71+
**2024 Performance:**
72+
- NEV sales: ~4.26 million units, #1 globally for two consecutive years
73+
- Revenue: ~620 billion RMB
74+
- Power battery installations: Top 3 globally
75+
76+
**Global Presence & Models:**
77+
- China: Dynasty series (Han, Tang, Song), Ocean series (Seal, Seagull)
78+
- Europe: Germany, France, UK
79+
- Southeast Asia: Thailand, Singapore, Malaysia
80+
- South America: Factory in Brazil
81+
82+
BYD has become a global leader in the NEV industry through innovation and globalization."""
83+
84+
result = asyncio.run(grader.aevaluate(query=query, response=response))
85+
print(f"Score: {result.score}")
86+
print(f"Reason: {result.reason}")
87+
print(f"Tool calls: {result.metadata.get('tool_calls', 0)}")
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Example 4: AgentScope Agent (Full Delegation)
4+
5+
Dependencies: pip install openjudge agentscope tavily-python
6+
Environment: OPENAI_API_KEY, TAVILY_API_KEY
7+
"""
8+
9+
import asyncio
10+
import os
11+
12+
from agentscope.agent import ReActAgent as ASReActAgent
13+
from agentscope.formatter import OpenAIChatFormatter
14+
from agentscope.model import OpenAIChatModel
15+
from agentscope.tool import Toolkit, ToolResponse
16+
from tavily import TavilyClient
17+
18+
from cookbooks.agentic_grader.adapters.agentscope import AgentScopeAgentAdapter
19+
from openjudge.graders.agentic_grader import AgenticGrader
20+
21+
# Create AgentScope tool
22+
toolkit = Toolkit()
23+
_tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
24+
25+
26+
@toolkit.register_tool_function
27+
def web_search(query: str) -> ToolResponse:
28+
"""Search the web for information to verify facts.
29+
30+
Args:
31+
query: The search query string.
32+
33+
Returns:
34+
Search results from the web.
35+
"""
36+
response = _tavily_client.search(query=query, max_results=3)
37+
results = [f"[{i}] {r['title']}: {r['content'][:200]}..." for i, r in enumerate(response.get("results", []), 1)]
38+
return ToolResponse(content="\n".join(results))
39+
40+
41+
# Create AgentScope Agent
42+
as_agent = ASReActAgent(
43+
name="fact_checker",
44+
sys_prompt="You are a fact-checking assistant.",
45+
model=OpenAIChatModel(api_key=os.getenv("OPENAI_API_KEY"), model_name="qwen3-32b"),
46+
formatter=OpenAIChatFormatter(),
47+
toolkit=toolkit,
48+
max_iters=5,
49+
)
50+
51+
# Create Grader
52+
grader = AgenticGrader(
53+
agent=AgentScopeAgentAdapter(as_agent),
54+
template="""
55+
You are a fact-checking assistant. Your task is to verify the factual accuracy of the given response.
56+
57+
**Question:** {query}
58+
**Response to evaluate:** {response}
59+
60+
Instructions:
61+
1. Identify the key factual claims in the response.
62+
2. Use the available search tool to verify each claim against reliable sources.
63+
3. Compare the search results with the claims in the response.
64+
4. Provide a score from 1-5 based on factual accuracy.
65+
66+
Scoring Criteria:
67+
- 5: All factual claims are verified and accurate.
68+
- 4: Core facts are correct with minor inaccuracies.
69+
- 3: Partially correct, some claims are wrong.
70+
- 2: Core facts contradict search results.
71+
- 1: Completely inaccurate or fabricated.
72+
73+
Output your evaluation in JSON format:
74+
{{"score": <1-5>, "reason": "<your detailed reasoning with search evidence>"}}
75+
""",
76+
name="agentscope_agent_correctness_grader",
77+
)
78+
79+
# Evaluate
80+
if __name__ == "__main__":
81+
query = """Please introduce BYD company, including:
82+
1. Full company name and stock code
83+
2. Main business areas and core technologies
84+
3. 2024 NEV sales volume and global ranking
85+
4. Key markets and popular models"""
86+
87+
response = """BYD Company Limited (Stock: A-share 002594, H-share 1211) is a global leader in new energy vehicles.
88+
89+
**Main Business:**
90+
The company focuses on NEVs, power batteries, and rail transit. It masters core technologies including batteries, motors, and electronic controls, being the only company globally that masters both battery and vehicle manufacturing.
91+
92+
**2024 Performance:**
93+
- NEV sales: ~4.26 million units, #1 globally for two consecutive years
94+
- Revenue: ~620 billion RMB
95+
- Power battery installations: Top 3 globally
96+
97+
**Global Presence & Models:**
98+
- China: Dynasty series (Han, Tang, Song), Ocean series (Seal, Seagull)
99+
- Europe: Germany, France, UK
100+
- Southeast Asia: Thailand, Singapore, Malaysia
101+
- South America: Factory in Brazil
102+
103+
BYD has become a global leader in the NEV industry through innovation and globalization."""
104+
105+
result = asyncio.run(grader.aevaluate(query=query, response=response))
106+
print(f"Score: {result.score}")
107+
print(f"Reason: {result.reason}")
108+
print(f"Tool calls: {result.metadata.get('tool_calls', 0)}")

0 commit comments

Comments
 (0)