Skip to content

Commit 981c8f9

Browse files
committed
upload initial files
1 parent 7af3976 commit 981c8f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1656
-1
lines changed

.gitignore

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# poetry
98+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102+
#poetry.lock
103+
104+
# pdm
105+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106+
#pdm.lock
107+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108+
# in version control.
109+
# https://pdm.fming.dev/#use-with-ide
110+
.pdm.toml
111+
112+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113+
__pypackages__/
114+
115+
# Celery stuff
116+
celerybeat-schedule
117+
celerybeat.pid
118+
119+
# SageMath parsed files
120+
*.sage.py
121+
122+
# Environments
123+
.env
124+
.venv
125+
env/
126+
venv/
127+
ENV/
128+
env.bak/
129+
venv.bak/
130+
131+
# Spyder project settings
132+
.spyderproject
133+
.spyproject
134+
135+
# Rope project settings
136+
.ropeproject
137+
138+
# mkdocs documentation
139+
/site
140+
141+
# mypy
142+
.mypy_cache/
143+
.dmypy.json
144+
dmypy.json
145+
146+
# Pyre type checker
147+
.pyre/
148+
149+
# pytype static type analyzer
150+
.pytype/
151+
152+
# Cython debug symbols
153+
cython_debug/
154+
155+
# PyCharm
156+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158+
# and can be added to the global gitignore or merged into this file. For a more nuclear
159+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160+
#.idea/
161+
162+
# Slurm
163+
*.out
164+
*.err
165+
166+
# User
167+
*DONT_TRACK*
168+
.vscode

README.md

Lines changed: 157 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,157 @@
1-
# promptodile
1+
# :crocodile: Promptodile
2+
3+
# Overview
4+
Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (<=14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.
5+
6+
# Citation
7+
```bibtex
8+
@inproceedings{gwon_2025_promptodile,
9+
title={Study on LLMs for Promptagator-Style Dense Retriever Training},
10+
author={Gwon, Daniel and Jedidi, Nour and Lin, Jimmy},
11+
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
12+
series = {CIKM '25},
13+
year = {2025},
14+
pages = {XXX--XXX},
15+
publisher = {Association for Computing Machinery},
16+
}
17+
```
18+
19+
# Installation
20+
For evaluation of our retrieval methods, we used [Pyserini](https://github.com/castorini/pyserini). Unfortunately, we found some issues when using Pyserini with the other required packages. Thus, we recommend creating separate virtual environments - one for query generation & retriever training and the other for indexing & evaluation.
21+
22+
## Query Generation and Retriever Training
23+
Uses the latest version of python 3.12 and CUDA 12.8 (see [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#installation) for details)
24+
25+
```bash
26+
$ # Create your environment.
27+
$ conda create -n promptodile python=3.12 -y
28+
$ conda activate promptodile
29+
$
30+
$ # First, install vllm with torch-backend flag.
31+
$ uv pip install vllm --torch-backend=auto
32+
$
33+
$ # Then, install the rest of your packages.
34+
$ uv pip install -r requirements.txt
35+
```
36+
37+
## Index/Evaluation
38+
To create a custom environment for Pyserini, users should visit [here](https://github.com/castorini/pyserini/blob/master/docs/installation.md#pypi-installation-walkthrough) for detailed instructions.
39+
40+
Note that the optional dependency, `faiss-cpu`, is only intended if you plan to index your corpus on CPU; use `faiss-gpu` to index on GPU. The linux installation instructions are also for CPU; if you have a GPU, please adjust the PyTorch index-url based on your CUDA version. We've found the faiss [installation instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) more useful to install `faiss-cpu` or `faiss-gpu`.
41+
42+
We use Python 3.11 to install Pyserini and CUDA 12.4 to install PyTorch & faiss-gpu. We run evaluation with Pyserini on Linux, and downgrade numpy to 1.26.4.
43+
44+
```bash
45+
$ # See links above for installation instructions for pyserini
46+
$ conda activate pyserini
47+
```
48+
49+
# Usage
50+
Usage can be broken down into three steps:
51+
1. Query Generation
52+
2. Retriever Model Training
53+
3. Index/Evaluation
54+
55+
There is considerable variation in configurations across each of the three steps. In addition to a shared configuration file, each step has its own configuration file.
56+
1. qgen.json
57+
2. train.json
58+
3. eval.json
59+
4. shared.json
60+
61+
Examples have been provided in `./configs/templates` to train [contriever](https://huggingface.co/facebook/contriever) and [e5](https://huggingface.co/facebook/contriever) backbone models.
62+
63+
## Query Generation
64+
Query generation is designed for offline batched inference using [vLLM](https://docs.vllm.ai/en/stable/). Furthermore, the package is designed to use instruct models, so chat templates should be used for best performance.
65+
66+
```bash
67+
$ conda activate promptodile
68+
$ python -m promptodile.query_generation.generate qgen.json shared.json
69+
```
70+
71+
## Retriever Training
72+
```bash
73+
$ conda activate promptodile
74+
$ accelerate launch --num_processes=GPUS -m promptodile.train train.json shared.json
75+
```
76+
77+
## Index/Evaluation
78+
```bash
79+
$ conda activate pyserini
80+
$
81+
$ # When run as a script, will automatically evaluate and output NDCG@10
82+
$ python -m promptodile.index index.json shared.json
83+
```
84+
85+
# Data
86+
For consistency, we attempt to follow the dataset formatting established by TREC (Text REtrieval Conference) as closely as possible.
87+
88+
## BEIR
89+
Please visit [BEIR](https://huggingface.co/BeIR) for relevant datasets.
90+
91+
You can use utility functions in `promptodile/utils.py` to convert the corpus and queries to TREC format.
92+
93+
## Input Files
94+
95+
### corpus.jsonl
96+
Contain all of the documents found in your corpus. Each json line in the file should contain three of five possible fields:
97+
1. `docid`
98+
2. `url` (not used)
99+
3. `title`
100+
4. `headings` (not used)
101+
5. `body`
102+
103+
[more details](https://trec-rag.github.io/annoucements/2025-rag25-corpus/#document-structure)
104+
105+
### queries.jsonl
106+
Contains the query text that maps to the Topics found in `examples.txt`. At minimum, the query text for each example must be provided in the following format:
107+
1. `id` (mapping to Topic in `examples.txt`)
108+
2. `narrative` (the query/topic's text)
109+
110+
### qrels
111+
This is a text file containing whitespace-delimited rows for documents, topics (or queries), and relevance judgments. No headers are included, but each entry in a row maps to:
112+
1. `Topic`
113+
2. `Iteration`
114+
3. `Document#`
115+
4. `Relevancy`
116+
117+
[more details](https://trec.nist.gov/data/qrels_eng/)
118+
119+
### examples.txt
120+
If provided, represents the few-shot examples to be used in the query generation prompt. Uses the same formatting as the qrels text file.
121+
122+
## Output Files
123+
124+
### syn_queries.jsonl
125+
Generated and uses the same formatting as corpus.jsonl, but adds a `queries` field to each line:
126+
1. `docid`
127+
2. `url` (not used)
128+
3. `title`
129+
4. `headings` (not used)
130+
5. `body`
131+
6. `queries` (generated)
132+
133+
The value for the `queries` field is a list containing each of the synthetic queries generated for the document.
134+
135+
### runs
136+
A text file containing a ranked list of retrieved documents for a set of queries. This is generated after indexing to evaluate the finetuned model. Rows are white-space delimited and the entries in each row correspond to the following headers (not included in the file):
137+
1. `Topic ID`
138+
2. `Q0` (a fixed string)
139+
3. `docid`
140+
4. `Rank`
141+
5. `Score`
142+
6. `Run ID`
143+
144+
[more details](https://trec-rag.github.io/annoucements/2025-track-guidelines/#output-format-ranked-results)
145+
146+
# Disclosure
147+
DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
148+
149+
This material is based upon work supported by the Department of the Air Force under Air Force Contract No. FA8702-15-D-0001 or FA8702-25-D-B002. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of the Air Force.
150+
151+
© 2025 Massachusetts Institute of Technology.
152+
153+
Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014)
154+
155+
The software/firmware is provided to you on an As-Is basis
156+
157+
Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

beir/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
These files are legacy and incompatible with the codebase in the existing state. These files are kept to document the prompts used to generate synthetic queries for various BEIR datasets.

beir/prompt_templates/arguana.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a passage presenting an argument then generate a relevant counter argument. A counter argument is relevant if it considers and thoughtfully presents an opposing viewpoint to the claim(s) made in the argument. Use the following examples to guide you. Respond with only the counter argument.",
3+
"user": "Argument: {}",
4+
"assistant": "{}"
5+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a passage about an entity and generate a relevant query. A query is relevant if the passage contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
3+
"user": "entity: {}",
4+
"assistant": "{}"
5+
}

beir/prompt_templates/fever.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a passage containing factual information and generate a relevant claim. A claim is relevant if the passage contains all of the necessary evidence to support or refute the claim. Use the following examples to guide you. Respond with only the factual claim.",
3+
"user": "{}",
4+
"assistant": "{}"
5+
}

beir/prompt_templates/fiqa.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a passage from the finance domain and generate a relevant query. A query is relevant if the article contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
3+
"user": "{}",
4+
"assistant": "{}"
5+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a passage containing information and generate a relevant question. A question is relevant if the passage contains some or all of the necessary information to answer the question. Use the following examples to guide you. Respond with only the question.",
3+
"user": "Passage: {}",
4+
"assistant": "{}"
5+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read an article and generate a relevant query. A query is relevant if the article contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
3+
"user": "Article: {}",
4+
"assistant": "{}"
5+
}

beir/prompt_templates/scidocs.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"system": "You are a high-quality synthetic data generator. Your task is to read a scientific document and classify the document with a single-sentence summary. Use the following examples to guide you. Respond with only the single-sentence summary.",
3+
"user": "{}",
4+
"assistant": "{}"
5+
}

0 commit comments

Comments
 (0)