Skip to content

Commit cb77301

Browse files
authored
Unifying aviary.litqa, aviary.labbench, and aviary.lfrqa (#262)
1 parent bbada84 commit cb77301

File tree

25 files changed

+1494
-819
lines changed

25 files changed

+1494
-819
lines changed

.github/workflows/publish.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,18 +45,18 @@ jobs:
4545
path: dist
4646
- name: Clean up aviary.hotpotqa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
4747
run: rm -r ${{ steps.build-aviary-hotpotqa.outputs.dist }}
48-
- id: build-aviary-litqa
48+
- id: build-aviary-labbench
4949
uses: hynek/build-and-inspect-python-package@v2
5050
with:
51-
path: packages/litqa
52-
upload-name-suffix: -litqa
53-
- name: Download built aviary.litqa artifact to dist/
51+
path: packages/labbench
52+
upload-name-suffix: -labbench
53+
- name: Download built aviary.labbench artifact to dist/
5454
uses: actions/download-artifact@v7
5555
with:
56-
name: ${{ steps.build-aviary-litqa.outputs.artifact-name }}
56+
name: ${{ steps.build-aviary-labbench.outputs.artifact-name }}
5757
path: dist
58-
- name: Clean up aviary.litqa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
59-
run: rm -r ${{ steps.build-aviary-litqa.outputs.dist }}
58+
- name: Clean up aviary.labbench build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
59+
run: rm -r ${{ steps.build-aviary-labbench.outputs.dist }}
6060
- id: build-aviary-lfrqa
6161
uses: hynek/build-and-inspect-python-package@v2
6262
with:

.github/workflows/tests.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -69,16 +69,16 @@ jobs:
6969
- name: Clean up aviary.hotpotqa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
7070
if: matrix.python-version == '3.11'
7171
run: rm -r ${{ steps.build-hotpotqa.outputs.dist }}
72-
- name: Check aviary.litqa build
73-
id: build-litqa
72+
- name: Check aviary.labbench build
73+
id: build-labbench
7474
if: matrix.python-version == '3.11'
7575
uses: hynek/build-and-inspect-python-package@v2
7676
with:
77-
path: packages/litqa
78-
upload-name-suffix: -litqa
79-
- name: Clean up aviary.litqa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
77+
path: packages/labbench
78+
upload-name-suffix: -labbench
79+
- name: Clean up aviary.labbench build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
8080
if: matrix.python-version == '3.11'
81-
run: rm -r ${{ steps.build-litqa.outputs.dist }}
81+
run: rm -r ${{ steps.build-labbench.outputs.dist }}
8282
- name: Check aviary.lfrqa build
8383
id: build-lfrqa
8484
if: matrix.python-version == '3.11'
@@ -116,7 +116,7 @@ jobs:
116116
uses: actions/cache@v5
117117
with:
118118
path: ~/.cache/huggingface/datasets
119-
key: ${{ runner.os }}-datasets-${{ hashFiles('packages/gsm8k') }}-${{ hashFiles('packages/hotpotqa') }}-${{ hashFiles('packages/litqa') }}-${{ hashFiles('packages/lfrqa') }}-${{ hashFiles('packages/notebook') }}
119+
key: ${{ runner.os }}-datasets-${{ hashFiles('packages/gsm8k') }}-${{ hashFiles('packages/hotpotqa') }}-${{ hashFiles('packages/labbench') }}-${{ hashFiles('packages/lfrqa') }}-${{ hashFiles('packages/notebook') }}
120120
restore-keys: ${{ runner.os }}-datasets-
121121
- run: uv run pytest -n 16 --dist=loadfile # auto only launches 8 workers in CI, despite runners have 16 cores
122122
env:

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ repos:
3737
args:
3838
- --word-list=.secrets.allowlist
3939
- --exclude-files=.secrets.baseline$
40-
exclude: tests/cassettes|litqa/tests/stub_data
40+
exclude: tests/cassettes|labbench/tests/stub_data
4141
- repo: https://github.com/jumanjihouse/pre-commit-hooks
4242
rev: 3.0.0
4343
hooks:
@@ -48,7 +48,7 @@ repos:
4848
- id: codespell
4949
additional_dependencies: [".[toml]"]
5050
exclude_types: [jupyter]
51-
exclude: '.*\.b64$|litqa/tests/stub_data'
51+
exclude: '.*\.b64$|labbench/tests/stub_data'
5252
- repo: https://github.com/pappasam/toml-sort
5353
rev: v0.24.3
5454
hooks:

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ pip install fhaviary
5858
To install aviary together with the incumbent environments:
5959

6060
```bash
61-
pip install 'fhaviary[gsm8k,hotpotqa,litqa,lfrqa,notebook]'
61+
pip install 'fhaviary[gsm8k,hotpotqa,labbench,lfrqa,notebook]'
6262
```
6363

6464
To run the tutorial notebooks:
@@ -424,9 +424,10 @@ Below we list some pre-existing environments implemented in Aviary:
424424
| ----------- | -------------------------------------------------------------- | -------------------- | ------------------------------------------------------- |
425425
| GSM8k | [`aviary.gsm8k`](https://pypi.org/project/aviary.gsm8k/) | `fhaviary[gsm8k]` | [`README.md`](packages/gsm8k/README.md#installation) |
426426
| HotPotQA | [`aviary.hotpotqa`](https://pypi.org/project/aviary.hotpotqa/) | `fhaviary[hotpotqa]` | [`README.md`](packages/hotpotqa/README.md#installation) |
427-
| LitQA | [`aviary.litqa`](https://pypi.org/project/aviary.litqa/) | `fhaviary[litqa]` | [`README.md`](packages/litqa/README.md#installation) |
427+
| LAB-Bench | [`aviary.labbench`](https://pypi.org/project/aviary.labbench/) | `fhaviary[labbench]` | [`README.md`](packages/labbench/README.md#installation) |
428428
| LFRQA | [`aviary.lfrqa`](https://pypi.org/project/aviary.lfrqa/) | `fhaviary[lfrqa]` | [`README.md`](packages/lfrqa/README.md#installation) |
429429
| Notebook | [`aviary.notebook`](https://pypi.org/project/aviary.notebook/) | `fhaviary[notebook]` | [`README.md`](packages/notebook/README.md#installation) |
430+
| LitQA | [`aviary.litqa`](https://pypi.org/project/aviary.litqa/) | Moved to `labbench` | Moved to `labbench` |
430431

431432
### Task Datasets
432433

packages/labbench/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# aviary.labbench
2+
3+
LAB-Bench environments implemented with aviary,
4+
allowing agents to perform question answering on scientific tasks.
5+
6+
## Installation
7+
8+
To install the LAB-Bench environment, run:
9+
10+
```bash
11+
pip install 'fhaviary[labbench]'
12+
```
13+
14+
## Usage
15+
16+
In [`labbench/env.py`](src/aviary/envs/labbench/env.py), you will find:
17+
18+
- `GradablePaperQAEnvironment`: an PaperQA-backed environment
19+
that can grade answers given an evaluation function.
20+
- `ImageQAEnvironment`: an `GradablePaperQAEnvironment`
21+
subclass for QA where image(s) are pre-added.
22+
23+
And in [`labbench/task.py`](src/aviary/envs/labbench/task.py), you will find:
24+
25+
- `TextQATaskDataset`: a task dataset designed to
26+
pull down FigQA, LitQA2, or TableQA from Hugging Face,
27+
and create one `GradablePaperQAEnvironment` per question.
28+
- `ImageQATaskDataset`: a task dataset that pairs with `ImageQAEnvironment`
29+
for FigQA or TableQA.
30+
31+
Here is an example of how to use them:
32+
33+
```python
34+
import os
35+
36+
from ldp.agent import SimpleAgent
37+
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
38+
from paperqa import Settings
39+
40+
from aviary.env import TaskDataset
41+
42+
43+
async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
44+
settings = Settings(paper_directory=folder_of_litqa_v2_papers)
45+
dataset = TaskDataset.from_name("litqa2", settings=settings)
46+
metrics_callback = MeanMetricsCallback(eval_dataset=dataset)
47+
48+
evaluator = Evaluator(
49+
config=EvaluatorConfig(batch_size=3),
50+
agent=SimpleAgent(),
51+
dataset=dataset,
52+
callbacks=[metrics_callback],
53+
)
54+
await evaluator.evaluate()
55+
print(metrics_callback.eval_means)
56+
```
57+
58+
### Image Question-Answer
59+
60+
This is an environment/dataset for giving PaperQA a `Docs` object with
61+
the image(s) for one LAB-Bench question.
62+
It's designed to be a comparison with zero-shotting the question to a LLM,
63+
but instead of a singular prompt the image is put through the PaperQA agent loop.
64+
65+
```python
66+
from typing import cast
67+
68+
import litellm
69+
import pytest
70+
from ldp.agent import Agent
71+
from ldp.alg import (
72+
Evaluator,
73+
EvaluatorConfig,
74+
MeanMetricsCallback,
75+
StoreTrajectoriesCallback,
76+
)
77+
from paperqa.settings import AgentSettings, IndexSettings
78+
79+
from aviary.envs.labbench import (
80+
ImageQAEnvironment,
81+
ImageQATaskDataset,
82+
LABBenchDatasets,
83+
)
84+
85+
86+
@pytest.mark.asyncio
87+
async def test_image_qa(tmp_path) -> None:
88+
litellm.num_retries = 8 # Mitigate connection-related failures
89+
settings = ImageQAEnvironment.make_base_settings()
90+
settings.agent = AgentSettings(
91+
agent_type="ldp.agent.SimpleAgent",
92+
index=IndexSettings(paper_directory=tmp_path),
93+
# TODO: add image support for paper_search
94+
tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
95+
agent_evidence_n=3, # Bumped up to collect several perspectives
96+
)
97+
dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
98+
t_cb = StoreTrajectoriesCallback()
99+
m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
100+
evaluator = Evaluator(
101+
config=EvaluatorConfig(
102+
batch_size=256, # Use batch size greater than FigQA size and TableQA size
103+
max_rollout_steps=18, # Match aviary paper's PaperQA setting
104+
),
105+
agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
106+
dataset=dataset,
107+
callbacks=[t_cb, m_cb],
108+
)
109+
await evaluator.evaluate()
110+
print(m_cb.eval_means)
111+
```
112+
113+
## References
114+
115+
[1] Skarlinski et al.
116+
[Language agents achieve superhuman synthesis of scientific knowledge](https://arxiv.org/abs/2409.13740).
117+
ArXiv:2409.13740, 2024.
118+
119+
[2] Laurent et al.
120+
[LAB-Bench: Measuring Capabilities of Language Models for Biology Research](https://arxiv.org/abs/2407.10362).
121+
ArXiv:2407.10362, 2024.
Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,14 @@ dependencies = [
2222
"fhaviary>=0.14", # For MultipleChoiceQuestion
2323
"fhlmi",
2424
"ldp>=0.25.2", # Pin for lmi migration
25-
"paper-qa>=5.14.0", # Pin for lmi migration
25+
"paper-qa[pymupdf]>=2025", # Pin for multimodal
2626
"pydantic~=2.0",
2727
"tenacity",
2828
"typing-extensions; python_version <= '3.12'", # For TypeVar default
2929
]
30-
description = "LitQA environment implemented with aviary"
30+
description = "LAB-Bench environments implemented with aviary"
3131
dynamic = ["version"]
32-
name = "aviary.litqa"
32+
name = "aviary.labbench"
3333
readme = "README.md"
3434
requires-python = ">=3.11"
3535

@@ -38,10 +38,12 @@ datasets = [
3838
"datasets>=2.15", # Lower pin for https://github.com/huggingface/datasets/pull/6404
3939
]
4040
dev = [
41-
"aviary.litqa[datasets]",
41+
"aviary.labbench[datasets,typing]",
42+
"pandas",
4243
"paper-qa>=5.29.1", # Pin for gen_answer's EmptyDocsError, with fix
4344
"tantivy>=0.25.0; python_version >= '3.14'", # For Python 3.14 support
4445
]
46+
typing = ["pillow"]
4547

4648
[tool.ruff]
4749
extend = "../../pyproject.toml"
@@ -51,4 +53,4 @@ where = ["src"]
5153

5254
[tool.setuptools_scm]
5355
root = "../.."
54-
version_file = "src/aviary/envs/litqa/version.py"
56+
version_file = "src/aviary/envs/labbench/version.py"
Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,31 @@
11
from .env import (
22
DEFAULT_REWARD_MAPPING,
33
GradablePaperQAEnvironment,
4+
ImageQAEnvironment,
45
make_discounted_returns,
56
)
67
from .task import (
78
DEFAULT_AVIARY_PAPER_HF_HUB_NAME,
89
DEFAULT_LABBENCH_HF_HUB_NAME,
9-
TASK_DATASET_NAME,
10-
LitQATaskDataset,
11-
LitQAv2TaskDataset,
12-
LitQAv2TaskSplit,
13-
read_litqa_v2_from_hub,
10+
ImageQATaskDataset,
11+
LABBenchDatasets,
12+
PaperQATaskDataset,
13+
TextQATaskDataset,
14+
TextQATaskSplit,
15+
read_ds_from_hub,
1416
)
1517

1618
__all__ = [
1719
"DEFAULT_AVIARY_PAPER_HF_HUB_NAME",
1820
"DEFAULT_LABBENCH_HF_HUB_NAME",
1921
"DEFAULT_REWARD_MAPPING",
20-
"TASK_DATASET_NAME",
2122
"GradablePaperQAEnvironment",
22-
"LitQATaskDataset",
23-
"LitQAv2TaskDataset",
24-
"LitQAv2TaskSplit",
23+
"ImageQAEnvironment",
24+
"ImageQATaskDataset",
25+
"LABBenchDatasets",
26+
"PaperQATaskDataset",
27+
"TextQATaskDataset",
28+
"TextQATaskSplit",
2529
"make_discounted_returns",
26-
"read_litqa_v2_from_hub",
30+
"read_ds_from_hub",
2731
]

0 commit comments

Comments
 (0)