Skip to content

Commit 5ad459d

Browse files
JoaquinPolonuerpre-commit-ci-lite[bot]jamesbraza
authored
Specific rag-qa-arena changes (#853)
Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com> Co-authored-by: James Braza <[email protected]>
1 parent 8618b87 commit 5ad459d

File tree

8 files changed

+570
-19
lines changed

8 files changed

+570
-19
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,9 @@ ENV/
265265
env.bak/
266266
venv.bak/
267267

268+
# Data directories
269+
data/
270+
268271
# Spyder project settings
269272
.spyderproject
270273
.spyproject

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,5 +101,6 @@ repos:
101101
- tiktoken>=0.4.0 # Match pyproject.toml
102102
- types-setuptools
103103
- types-PyYAML
104+
- typing-extensions # TODO: remove when Python>=3.13
104105
- sentence-transformers
105106
- pyzotero

docs/tutorials/running_on_lfrqa.md

Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# Measuring PaperQA2 with LFRQA
2+
3+
## Overview
4+
5+
The **LFRQA dataset** was introduced in the paper [_LFRQA: Large-Scale Few-Shot Retrieval Question Answering_](https://arxiv.org/pdf/2407.13998). It features **1,404 science questions** (along with other categories) that have been human-annotated with answers. This tutorial walks through the process of setting up the dataset for use and benchmarking.
6+
7+
## Download the Annotations
8+
9+
First, we need to obtain the annotated dataset from the official repository:
10+
11+
```bash
12+
# Create a new directory for the dataset
13+
mkdir -p data/rag-qa-benchmarking
14+
15+
# Get the annotated questions
16+
curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/annotations_science_with_citation.jsonl -o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl
17+
```
18+
19+
## Download the Robust-QA Documents
20+
21+
LFRQA is built upon **Robust-QA**, so we must download the relevant documents:
22+
23+
```bash
24+
# Download the Lotte dataset, which includes the required documents
25+
curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz
26+
27+
# Extract the dataset
28+
tar -xvzf lotte.tar.gz
29+
30+
# Move the science test collection to our dataset folder
31+
cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv
32+
33+
# Clean up unnecessary files
34+
rm lotte.tar.gz
35+
rm -rf lotte
36+
```
37+
38+
For more details, refer to the original paper: [_LFRQA: Large-Scale Few-Shot Retrieval Question Answering_](https://arxiv.org/pdf/2407.13998).
39+
40+
## Load the Data
41+
42+
We now load the documents into a pandas dataframe:
43+
44+
```python
45+
import os
46+
import pandas as pd
47+
48+
# Load questions and answers dataset
49+
rag_qa_benchmarking_dir = os.path.join("data", "rag-qa-benchmarking")
50+
51+
# Load documents dataset
52+
lfrqa_docs_df = pd.read_csv(
53+
os.path.join(rag_qa_benchmarking_dir, "science_test_collection.tsv"),
54+
sep="\t",
55+
names=["doc_id", "doc_text"],
56+
)
57+
```
58+
59+
## Select the Documents to Use
60+
61+
RobustQA consists on 1.7M documents, so building the whole index will take around 3 hours.
62+
63+
If you want to run a test, you can use a portion of the dataset and the questions that can be answered only on those documents.
64+
65+
```python
66+
proportion_to_use = 1 / 100
67+
amount_of_docs_to_use = int(len(lfrqa_docs_df) * proportion_to_use)
68+
print(f"Using {amount_of_docs_to_use} out of {len(lfrqa_docs_df)} documents")
69+
```
70+
71+
## Prepare the Document Files
72+
73+
We now create the document directory and store each document as a separate text file, so that paperqa can build the index.
74+
75+
```python
76+
partial_docs = lfrqa_docs_df.head(amount_of_docs_to_use)
77+
lfrqa_directory = os.path.join(rag_qa_benchmarking_dir, "lfrqa")
78+
os.makedirs(
79+
os.path.join(lfrqa_directory, "science_docs_for_paperqa", "files"), exist_ok=True
80+
)
81+
82+
for i, row in partial_docs.iterrows():
83+
doc_id = row["doc_id"]
84+
doc_text = row["doc_text"]
85+
86+
with open(
87+
os.path.join(
88+
lfrqa_directory, "science_docs_for_paperqa", "files", f"{doc_id}.txt"
89+
),
90+
"w",
91+
encoding="utf-8",
92+
) as f:
93+
f.write(doc_text)
94+
95+
if i % int(len(partial_docs) * 0.05) == 0:
96+
progress = (i + 1) / len(partial_docs)
97+
print(f"Progress: {progress:.2%}")
98+
```
99+
100+
## Create the Manifest File
101+
102+
The **manifest file** keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster.
103+
104+
```python
105+
manifest = partial_docs.copy()
106+
manifest["file_location"] = manifest["doc_id"].apply(lambda x: f"files/{x}.txt")
107+
manifest["doi"] = ""
108+
manifest["title"] = manifest["doc_id"]
109+
manifest["key"] = manifest["doc_id"]
110+
manifest["docname"] = manifest["doc_id"]
111+
manifest["citation"] = "_"
112+
manifest.drop(columns=["doc_id", "doc_text"], inplace=True)
113+
manifest.to_csv(
114+
os.path.join(lfrqa_directory, "science_docs_for_paperqa", "manifest.csv"),
115+
index=False,
116+
)
117+
```
118+
119+
## Filter and Save Questions
120+
121+
Finally, we load the questions and filter them to ensure we only include questions that reference the selected documents:
122+
123+
```python
124+
questions_df = pd.read_json(
125+
os.path.join(rag_qa_benchmarking_dir, "annotations_science_with_citation.jsonl"),
126+
lines=True,
127+
)
128+
partial_questions = questions_df[
129+
questions_df.gold_doc_ids.apply(
130+
lambda ids: all(id < amount_of_docs_to_use for id in ids)
131+
)
132+
]
133+
partial_questions.to_csv(
134+
os.path.join(lfrqa_directory, "questions.csv"),
135+
index=False,
136+
)
137+
```
138+
139+
## Install paperqa
140+
141+
From now on, we will be using the paperqa library, so we need to install it:
142+
143+
```bash
144+
pip install paper-qa
145+
```
146+
147+
## Index the documents
148+
149+
Copy the following to a file and run it. Feel free to adjust the concurrency as you like.
150+
151+
You don’t need any api keys for building this index because we don't discern any citation metadata, but you do need LLM api keys to answer questions.
152+
153+
Remember that this process is quick for small portions of the dataset, but can take around 3 hours for the whole dataset.
154+
155+
```python
156+
import os
157+
158+
from paperqa import Settings, ask
159+
from paperqa.agents import build_index
160+
from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings
161+
162+
settings = Settings(
163+
agent=AgentSettings(
164+
index=IndexSettings(
165+
name="lfrqa_science_index0.1",
166+
paper_directory=os.path.join(
167+
"data", "rag-qa-benchmarking", "lfrqa", "science_docs_for_paperqa"
168+
),
169+
index_directory=os.path.join(
170+
"data", "rag-qa-benchmarking", "lfrqa", "science_docs_for_paperqa_index"
171+
),
172+
manifest_file="manifest.csv",
173+
concurrency=10_000,
174+
batch_size=10_000,
175+
)
176+
),
177+
parsing=ParsingSettings(
178+
use_doc_details=False,
179+
defer_embedding=True,
180+
),
181+
)
182+
183+
build_index(settings=settings)
184+
```
185+
186+
After this runs, you will get an answer!
187+
188+
## Benchmark!
189+
190+
After you have built the index, you are ready to run the benchmark.
191+
192+
Copy the following into a file and run it. To run this, you will need to have the [`ldp`](https://github.com/Future-House/ldp) package installed.
193+
194+
```python
195+
import asyncio
196+
import json
197+
import os
198+
199+
import pandas as pd
200+
from ldp.agent import SimpleAgent
201+
from ldp.alg.runners import Evaluator, EvaluatorConfig
202+
203+
from paperqa import Settings
204+
from paperqa.agents.task import LFRQAQuestion, LFRQATaskDataset
205+
from paperqa.settings import AgentSettings, IndexSettings
206+
207+
log_results_dir = os.path.join("data", "rag-qa-benchmarking", "results")
208+
os.makedirs(log_results_dir, exist_ok=True)
209+
210+
211+
async def log_evaluation_to_json(lfrqa_question_evaluation: dict) -> None:
212+
json_path = os.path.join(
213+
log_results_dir, f"{lfrqa_question_evaluation['qid']}.json"
214+
)
215+
with open(json_path, "w") as f:
216+
json.dump(lfrqa_question_evaluation, f, indent=2)
217+
218+
219+
async def evaluate() -> None:
220+
settings = Settings(
221+
agent=AgentSettings(
222+
index=IndexSettings(
223+
name="lfrqa_science_index",
224+
paper_directory=os.path.join(
225+
"data", "rag-qa-benchmarking", "lfrqa", "science_docs_for_paperqa"
226+
),
227+
index_directory=os.path.join(
228+
"data",
229+
"rag-qa-benchmarking",
230+
"lfrqa",
231+
"science_docs_for_paperqa_index",
232+
),
233+
)
234+
)
235+
)
236+
237+
data: list[LFRQAQuestion] = [
238+
LFRQAQuestion(**row)
239+
for row in pd.read_csv(
240+
os.path.join("data", "rag-qa-benchmarking", "lfrqa", "questions.csv")
241+
)[["qid", "question", "answer", "gold_doc_ids"]].to_dict(orient="records")
242+
]
243+
244+
dataset = LFRQATaskDataset(
245+
data=data,
246+
settings=settings,
247+
evaluation_callback=log_evaluation_to_json,
248+
)
249+
250+
evaluator = Evaluator(
251+
config=EvaluatorConfig(batch_size=3),
252+
agent=SimpleAgent(),
253+
dataset=dataset,
254+
)
255+
await evaluator.evaluate()
256+
257+
258+
if __name__ == "__main__":
259+
asyncio.run(evaluate())
260+
```
261+
262+
After running this, you can find the results in the `data/rag-qa-benchmarking/results` folder. Here is an example of how to read them:
263+
264+
```python
265+
import glob
266+
import json
267+
268+
json_files = glob.glob(os.path.join(rag_qa_benchmarking_dir, "results", "*.json"))
269+
270+
data = []
271+
for file in json_files:
272+
with open(file) as f:
273+
json_data = json.load(f)
274+
json_data["qid"] = file.split("/")[-1].replace(".json", "")
275+
data.append(json_data)
276+
277+
df = pd.DataFrame(data).set_index("qid")
278+
df["winner"].value_counts(normalize=True)
279+
```

paperqa/agents/search.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import os
88
import pathlib
99
import pickle
10+
import re
1011
import warnings
1112
import zlib
1213
from collections import Counter
@@ -400,9 +401,8 @@ async def get_saved_object(
400401
return None
401402

402403
def clean_query(self, query: str) -> str:
403-
for replace in ("*", "[", "]", ":", "(", ")", "{", "}", "~", '"'):
404-
query = query.replace(replace, "")
405-
return query
404+
# SEE: https://regex101.com/r/DoLMoa/3
405+
return re.sub(r'[*\[\]:(){}~^><+"\\]', "", query)
406406

407407
async def query(
408408
self,

0 commit comments

Comments
 (0)