Skip to content

Commit b0b96d0

Browse files
Feat/ref hallucination arena (#120)
* docs: add Reference Hallucination Arena benchmark documentation and examples Add comprehensive documentation for the Reference Hallucination Arena benchmark, which evaluates LLM reference recommendation accuracy by verifying citations against Crossref, PubMed, arXiv, and DBLP. The dataset is hosted on HuggingFace (OpenJudge/ref-hallucination-arena). - Add docs/validating_graders/ref_hallucination_arena.md with full guide - Add cookbooks/ref_hallucination_arena/examples/ with config templates (config.yaml, minimal_config.yaml, queries_example.json) - Update mkdocs.yml to include new doc in navigation - Update validating_graders/overview.md with cross-references Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add Reference Hallucination Arena cookbook implementation Add the complete ref_hallucination_arena cookbook with pipeline, verifiers, collectors, scoring, and reporting modules for evaluating LLM reference recommendation accuracy against Crossref, PubMed, arXiv, and DBLP. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor: simplify ref_hallucination_arena cookbook and update docs Streamline collectors, verifiers, scoring, reporting modules and pipeline by removing redundant code. Update documentation accordingly. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: address code review issues in ref_hallucination_arena - Fix buggy BibTeX field regex in bib_extractor.py: split into separate patterns for brace-delimited, quote-delimited, and unquoted numeric values - Fix dead fallback logic in pipeline.py: use `not` instead of `is None` since responses are initialized as empty dicts, not None - Eliminate duplicate config loading in __main__.py: load config once in main() and pass the object directly to _run_evaluation() - Remove unused variable `completed_queries` in response_collector.py - Simplify match_detail access in objective_scorer.py: use direct attribute access instead of redundant dict/getattr branching Co-authored-by: Cursor <cursoragent@cursor.com> * fix: resolve pre-commit hook failures - Import RefArenaConfig in __main__.py to fix flake8 F821 undefined name - Reformat bib_extractor.py to satisfy black code style Co-authored-by: Cursor <cursoragent@cursor.com> * style: reformat bib_extractor.py with black 25.9.0 (line-length=120) Co-authored-by: Cursor <cursoragent@cursor.com> * docs: add Reference Hallucination Arena news to README Add news entry for Reference Hallucination Arena benchmark in both English and Chinese READMEs, and remove outdated v0.2.0 release note from the English README. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 267b047 commit b0b96d0

29 files changed

+4465
-3
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,14 @@ It can also convert grading results into **reward signals** to help you **fine-t
4747

4848
## News
4949

50+
- **2026-02-12** - 📚 **Reference Hallucination Arena** - Benchmark for evaluating LLM academic reference hallucination. 👉 [Documentation](./docs/validating_graders/ref_hallucination_arena.md)
51+
5052
- **2026-01-27** - 🆕 **Paper Review** - Automatically review academic papers using LLM-powered evaluation. 👉 [Documentation](https://agentscope-ai.github.io/OpenJudge/applications/paper_review/)
5153

5254
- **2026-01-27** - 🖥️ **OpenJudge UI** - A Streamlit-based visual interface for grader testing and Auto Arena. Run `streamlit run ui/app.py` to get started.
5355

5456
- **2026-01-05** - 🏟️ **Auto Arena** - Automatically evaluate and compare multiple models without pre-existing test data. 👉 [Documentation](https://agentscope-ai.github.io/OpenJudge/applications/auto_arena/)
5557

56-
- **2025-12-26** - Released OpenJudge v0.2.0 on [PyPI](https://pypi.org/project/py-openjudge/)[migration-guide](#migration-guide-v01x--v020)
57-
5858
---
5959

6060
## ✨ Key Features

README_zh.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ OpenJudge 提供**即用型评分器**,并支持生成**场景特定的评估
112112
----
113113
## 最新动态
114114

115+
- **2026-02-12** - 📚 **Reference Hallucination Arena** - 评估大语言模型学术引用幻觉的基准测试。 👉 [文档](./docs/validating_graders/ref_hallucination_arena.md)
116+
115117
- **2025-12-26** - 在 [PyPI](https://pypi.org/project/py-openjudge/) 上发布 OpenJudge v0.2.0 - **重大更新!** 此版本通过在奖励构建之上添加对多样化评估场景的强大支持,扩展了我们的核心能力。通过统一奖励和评估信号,OpenJudge v0.2.0 提供了一种更全面的方法来优化应用性能和卓越性。→ [迁移指南](#迁移指南v01x--v020)
116118

117119
- **2025-10-20** - [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314) - 我们发布了一篇关于学习可泛化奖励标准以实现稳健建模的新论文。
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# -*- coding: utf-8 -*-
2+
"""CLI entry point for Reference Hallucination Arena.
3+
4+
Usage:
5+
python -m cookbooks.ref_hallucination_arena --config config.yaml
6+
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
7+
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh
8+
"""
9+
10+
import asyncio
11+
from pathlib import Path
12+
from typing import Optional
13+
14+
import fire
15+
from loguru import logger
16+
17+
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline
18+
from cookbooks.ref_hallucination_arena.schema import RefArenaConfig, load_config
19+
20+
21+
async def _run_evaluation(
22+
config: RefArenaConfig,
23+
save: bool = False,
24+
resume: bool = True,
25+
) -> None:
26+
"""Run the evaluation pipeline."""
27+
pipeline = RefArenaPipeline(config=config, resume=resume)
28+
result = await pipeline.evaluate()
29+
30+
if save:
31+
pipeline.save_results(result)
32+
33+
34+
def main(
35+
config: str,
36+
output_dir: Optional[str] = None,
37+
save: bool = False,
38+
fresh: bool = False,
39+
) -> None:
40+
"""Reference Hallucination Arena CLI.
41+
42+
Evaluate LLM reference recommendation capabilities by verifying
43+
recommended papers against Crossref, PubMed, arXiv, and DBLP.
44+
45+
Args:
46+
config: Path to YAML configuration file.
47+
output_dir: Output directory for results (overrides config).
48+
save: Whether to save results to file.
49+
fresh: Start fresh, ignore any existing checkpoint.
50+
51+
Examples:
52+
# Normal run (auto-resumes from checkpoint)
53+
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
54+
55+
# Start fresh
56+
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save
57+
"""
58+
config_path = Path(config)
59+
if not config_path.exists():
60+
logger.error(f"Config file not found: {config}")
61+
return
62+
63+
# Load config once and apply output_dir override
64+
loaded_config = load_config(str(config_path))
65+
if output_dir:
66+
loaded_config.output.output_dir = output_dir
67+
68+
if fresh:
69+
logger.info("Starting fresh (ignoring checkpoint)")
70+
from cookbooks.ref_hallucination_arena.pipeline import CheckpointManager
71+
72+
CheckpointManager(loaded_config.output.output_dir).clear()
73+
else:
74+
logger.info("Resume mode enabled")
75+
76+
logger.info(f"Starting Reference Hallucination Arena with config: {config}")
77+
78+
asyncio.run(
79+
_run_evaluation(
80+
loaded_config,
81+
save,
82+
resume=not fresh,
83+
)
84+
)
85+
86+
87+
if __name__ == "__main__":
88+
fire.Fire(main)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# -*- coding: utf-8 -*-
2+
"""Data collectors for Reference Hallucination Arena."""
3+
4+
from cookbooks.ref_hallucination_arena.collectors.bib_extractor import BibExtractor
5+
from cookbooks.ref_hallucination_arena.collectors.response_collector import (
6+
ResponseCollector,
7+
)
8+
9+
__all__ = ["BibExtractor", "ResponseCollector"]
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# -*- coding: utf-8 -*-
2+
"""Extract BibTeX references from free-text model responses."""
3+
4+
import re
5+
from typing import List, Optional
6+
7+
from cookbooks.ref_hallucination_arena.schema import Reference
8+
9+
10+
class BibExtractor:
11+
"""Extract BibTeX entries from model responses.
12+
13+
Strategies (tried in order):
14+
1. Extract content inside ```bib / ```bibtex code fences.
15+
2. Extract standalone @type{...} entries scattered in the text.
16+
3. Fallback: try to parse structured plain-text references.
17+
"""
18+
19+
# Matches ```bib or ```bibtex fenced code blocks
20+
_FENCE_PATTERN = re.compile(
21+
r"```(?:bib(?:tex)?)\s*\n(.*?)```",
22+
re.DOTALL | re.IGNORECASE,
23+
)
24+
25+
# Matches a full BibTeX entry: @type{key, ... }
26+
# Uses brace-counting to handle nested braces correctly
27+
_ENTRY_START_PATTERN = re.compile(
28+
r"@(\w+)\s*\{\s*([^,\s]*)\s*,",
29+
re.IGNORECASE,
30+
)
31+
32+
def extract(self, response_text: str) -> List[Reference]:
33+
"""Extract references from a model response.
34+
35+
Args:
36+
response_text: Raw text response from the model.
37+
38+
Returns:
39+
List of extracted Reference objects.
40+
"""
41+
if not response_text:
42+
return []
43+
44+
# Strategy 1: fenced code blocks
45+
fenced_content = self._extract_fenced(response_text)
46+
if fenced_content:
47+
refs = self._parse_bibtex(fenced_content)
48+
if refs:
49+
return refs
50+
51+
# Strategy 2: standalone entries in text
52+
refs = self._parse_bibtex(response_text)
53+
if refs:
54+
return refs
55+
56+
# Strategy 3: plain-text fallback (numbered references)
57+
return self._parse_plain_text(response_text)
58+
59+
def _extract_fenced(self, text: str) -> str:
60+
"""Extract content from ```bib/bibtex fenced blocks."""
61+
blocks = self._FENCE_PATTERN.findall(text)
62+
if blocks:
63+
return "\n\n".join(blocks)
64+
return ""
65+
66+
def _parse_bibtex(self, text: str) -> List[Reference]:
67+
"""Parse BibTeX entries using brace-counting for robustness."""
68+
refs = []
69+
70+
for match in self._ENTRY_START_PATTERN.finditer(text):
71+
entry_type = match.group(1).lower()
72+
key = match.group(2).strip()
73+
74+
# Find the matching closing brace via counting
75+
start = match.start()
76+
brace_start = text.index("{", start)
77+
fields_str = self._extract_braced_content(text, brace_start)
78+
if fields_str is None:
79+
continue
80+
81+
ref = self._parse_fields(key, entry_type, fields_str)
82+
if ref:
83+
refs.append(ref)
84+
85+
return refs
86+
87+
def _extract_braced_content(self, text: str, open_pos: int) -> Optional[str]:
88+
"""Extract content between matched braces starting at open_pos."""
89+
depth = 0
90+
for i in range(open_pos, len(text)):
91+
if text[i] == "{":
92+
depth += 1
93+
elif text[i] == "}":
94+
depth -= 1
95+
if depth == 0:
96+
return text[open_pos + 1 : i]
97+
return None # unmatched
98+
99+
def _parse_fields(self, key: str, entry_type: str, fields_str: str) -> Optional[Reference]:
100+
"""Parse individual fields from BibTeX entry body."""
101+
102+
def extract_field(name: str) -> Optional[str]:
103+
# Match field = {value}, field = "value", or field = number
104+
# Try brace-delimited value first (handles nested braces)
105+
brace_pattern = rf"{name}\s*=\s*\{{(.*?)\}}"
106+
m = re.search(brace_pattern, fields_str, re.IGNORECASE | re.DOTALL)
107+
if m:
108+
return m.group(1).strip()
109+
# Try quote-delimited value
110+
quote_pattern = rf'{name}\s*=\s*"(.*?)"'
111+
m = re.search(quote_pattern, fields_str, re.IGNORECASE | re.DOTALL)
112+
if m:
113+
return m.group(1).strip()
114+
# Try unquoted numeric value (e.g., year = 2023)
115+
num_pattern = rf"{name}\s*=\s*(\d+)"
116+
m = re.search(num_pattern, fields_str, re.IGNORECASE)
117+
if m:
118+
return m.group(1).strip()
119+
return None
120+
121+
title = extract_field("title")
122+
if not title:
123+
return None
124+
125+
# Extract arXiv ID
126+
arxiv_id = None
127+
journal = extract_field("journal") or extract_field("booktitle") or ""
128+
eprint = extract_field("eprint")
129+
if eprint:
130+
arxiv_id = eprint
131+
elif "arxiv" in journal.lower():
132+
arxiv_match = re.search(r"(\d{4}\.\d{4,5})", journal)
133+
if arxiv_match:
134+
arxiv_id = arxiv_match.group(1)
135+
136+
# Extract PMID from note or url
137+
pmid = None
138+
note = extract_field("note") or ""
139+
url = extract_field("url") or ""
140+
pmid_match = re.search(r"(?:PMID|pmid)[:\s]*(\d+)", note + " " + url)
141+
if pmid_match:
142+
pmid = pmid_match.group(1)
143+
144+
return Reference(
145+
key=key,
146+
title=title,
147+
authors=extract_field("author"),
148+
year=extract_field("year"),
149+
journal=journal,
150+
doi=extract_field("doi"),
151+
arxiv_id=arxiv_id,
152+
pmid=pmid,
153+
entry_type=entry_type,
154+
)
155+
156+
def _parse_plain_text(self, text: str) -> List[Reference]:
157+
"""Fallback: parse numbered plain-text references.
158+
159+
Handles patterns like:
160+
1. Author et al. (2023). "Title". Journal.
161+
[1] Author et al., "Title", Journal, 2023.
162+
"""
163+
refs = []
164+
165+
# Pattern: numbered reference with quoted title
166+
patterns = [
167+
# "1. Authors (Year). Title. Journal."
168+
re.compile(
169+
r"(?:^|\n)\s*(?:\d+[\.\)]\s*|[\[\(]\d+[\]\)]\s*)"
170+
r"(.+?)\s*[\(\[]?(\d{4})[\)\]]?\s*[\.\,]\s*"
171+
r'["\u201c](.+?)["\u201d]',
172+
re.MULTILINE,
173+
),
174+
# Simpler: "Title" (Year)
175+
re.compile(
176+
r'["\u201c](.+?)["\u201d]\s*[\(\[]?(\d{4})[\)\]]?',
177+
),
178+
]
179+
180+
seen_titles = set()
181+
for pattern in patterns:
182+
for m in pattern.finditer(text):
183+
groups = m.groups()
184+
if len(groups) >= 3:
185+
authors, year, title = groups[0], groups[1], groups[2]
186+
elif len(groups) >= 2:
187+
title, year = groups[0], groups[1]
188+
authors = None
189+
else:
190+
continue
191+
192+
title_lower = title.strip().lower()
193+
if title_lower in seen_titles or len(title_lower) < 10:
194+
continue
195+
seen_titles.add(title_lower)
196+
197+
refs.append(
198+
Reference(
199+
key=f"ref_{len(refs)+1}",
200+
title=title.strip(),
201+
authors=authors.strip() if authors else None,
202+
year=year.strip(),
203+
)
204+
)
205+
206+
return refs

0 commit comments

Comments
 (0)