feat: expand reference dataset with 25 new diagrams across 6 new domain categories#117
Conversation
|
@dippatel1994 |
|
Thanks @dev-miro26 venue-specific guidelines + Before merge, please address:
Follow-ups (non-blocking): document order-dependent “balanced” sampling; consider a warning when Happy to re-check after the curated/manifest behavior is tightened. |
|
@dippatel1994 |
|
Thanks @dev-miro26 love the passion! Appreciate your contribution to paperBana. |
|
@dippatel1994 |
|
@dippatel1994 |
…r of curated methodology diagrams from 13 to 38, updated version to 3.0.0, and expanded categories. Added multiple new images related to various research topics.
58c0b44 to
8721e72
Compare
dippatel1994
left a comment
There was a problem hiding this comment.
CI passes, good dataset expansion. Two things to fix:
-
Inconsistent ID format — Existing 13 entries use arxiv IDs (e.g.,
2601.03570v1). Issue #90 explicitly says "id is the arxiv ID." New entries usepb_ref_42,pb_ref_24, etc. These show up as "Paper ID" in the retriever prompt —pb_ref_42is less meaningful than an arxiv ID. Please use arxiv IDs. -
Missing
source_paperfield — All 13 original entries include"source_paper". None of the 25 new entries have it. Add for consistency and provenance tracking.
Non-blocking: No tests added to validate the new entries load correctly. A lightweight test that loads real index.json and checks counts/image existence would prevent regressions.
dippatel1994
left a comment
There was a problem hiding this comment.
All 3 points addressed: arxiv IDs used, source_paper added, 10 tests added. CI green. LGTM.
|
Could you please merge this PR? |
Summary
Closes #90
The Retriever currently only has 13 reference diagrams across 4 categories (
agent_reasoning,vision_perception,generative_learning,science_applications). Papers outside those domains get poor few-shot examples, which degrades Planner output.This PR hand-picks 25 new reference diagrams from PaperBananaBench and adds 6 new domain categories, bringing the total to 38 examples across 10 categories.
New categories and entries
healthcare_medicalrobotics_controlnlp_languagemultimodal_fusionsystems_networkingoptimization_theoryWhat changed
data/reference_sets/index.json— 25 newReferenceExampleentries following the existing schema (id,source_context,caption,image_path,category,aspect_ratio,structure_hints). Metadata bumped:version2.0.0 → 3.0.0,total_examples13 → 38,categories4 → 10.data/reference_sets/images/— 25 new diagram images extracted from PaperBananaBench.prompts/diagram/retriever.txt— Line 21 domain list extended with the 6 new domains so the VLM understands the expanded domain space during ranking.Selection criteria
Each diagram was chosen because it:
Note on category naming
The existing
curated_expansion.jsonon this branch uses slightly different category names for some overlapping concepts (e.g.systems_architecturevssystems_networking,multimodal_learningvsmultimodal_fusion). These should be reconciled — happy to align in either direction based on reviewer preference.Test plan
pytest tests/test_pipeline/ tests/test_agents/ tests/test_reference/ tests/test_data/)ReferenceStoreloads all 38 entries andget_by_categoryreturns correct counts for each new categoryimage_path