mit-ll
diff --git a/‎.gitignore‎
Lines changed: 168 additions & 0 deletions b/‎.gitignore‎
Lines changed: 168 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 157 additions & 1 deletion b/‎README.md‎
Lines changed: 157 additions & 1 deletion
diff --git a/‎beir/README.md‎
Lines changed: 1 addition & 0 deletions b/‎beir/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎beir/prompt_templates/arguana.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/arguana.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/dbpedia-entity.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/dbpedia-entity.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/fever.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/fever.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/fiqa.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/fiqa.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/hotpotqa.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/hotpotqa.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/nfcorpus.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/nfcorpus.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎beir/prompt_templates/scidocs.json‎
Lines changed: 5 additions & 0 deletions b/‎beir/prompt_templates/scidocs.json‎
Lines changed: 5 additions & 0 deletions
@@ -0,0 +1,168 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+# Slurm
+*.out
+*.err
+
+# User
+*DONT_TRACK*
+.vscode
@@ -1 +1,157 @@
-# promptodile
+# :crocodile: Promptodile
+
+# Overview
+Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (<=14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.
+
+# Citation
+```bibtex
+@inproceedings{gwon_2025_promptodile,
+  title={Study on LLMs for Promptagator-Style Dense Retriever Training},
+  author={Gwon, Daniel and Jedidi, Nour and Lin, Jimmy},
+  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
+  series = {CIKM '25},
+  year = {2025},
+  pages = {XXX--XXX},
+  publisher = {Association for Computing Machinery},
+}
+```
+
+# Installation
+For evaluation of our retrieval methods, we used [Pyserini](https://github.com/castorini/pyserini). Unfortunately, we found some issues when using Pyserini with the other required packages. Thus, we recommend creating separate virtual environments - one for query generation & retriever training and the other for indexing & evaluation.
+
+## Query Generation and Retriever Training
+Uses the latest version of python 3.12 and CUDA 12.8 (see [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#installation) for details)
+
+```bash
+$ # Create your environment.
+$ conda create -n promptodile python=3.12 -y
+$ conda activate promptodile
+$
+$ # First, install vllm with torch-backend flag.
+$ uv pip install vllm --torch-backend=auto
+$
+$ # Then, install the rest of your packages.
+$ uv pip install -r requirements.txt
+```
+
+## Index/Evaluation
+To create a custom environment for Pyserini, users should visit [here](https://github.com/castorini/pyserini/blob/master/docs/installation.md#pypi-installation-walkthrough) for detailed instructions.
+
+Note that the optional dependency, `faiss-cpu`, is only intended if you plan to index your corpus on CPU; use `faiss-gpu` to index on GPU. The linux installation instructions are also for CPU; if you have a GPU, please adjust the PyTorch index-url based on your CUDA version. We've found the faiss [installation instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) more useful to install `faiss-cpu` or `faiss-gpu`.
+
+We use Python 3.11 to install Pyserini and CUDA 12.4 to install PyTorch & faiss-gpu. We run evaluation with Pyserini on Linux, and downgrade numpy to 1.26.4.
+
+```bash
+$ # See links above for installation instructions for pyserini
+$ conda activate pyserini
+```
+
+# Usage
+Usage can be broken down into three steps:
+1. Query Generation
+2. Retriever Model Training
+3. Index/Evaluation
+
+There is considerable variation in configurations across each of the three steps. In addition to a shared configuration file, each step has its own configuration file.
+1. qgen.json
+2. train.json
+3. eval.json
+4. shared.json
+
+Examples have been provided in `./configs/templates` to train [contriever](https://huggingface.co/facebook/contriever) and [e5](https://huggingface.co/facebook/contriever) backbone models. 
+
+## Query Generation
+Query generation is designed for offline batched inference using [vLLM](https://docs.vllm.ai/en/stable/). Furthermore, the package is designed to use instruct models, so chat templates should be used for best performance.
+
+```bash
+$ conda activate promptodile
+$ python -m promptodile.query_generation.generate qgen.json shared.json
+```
+
+## Retriever Training
+```bash
+$ conda activate promptodile
+$ accelerate launch --num_processes=GPUS -m promptodile.train train.json shared.json
+```
+
+## Index/Evaluation
+```bash
+$ conda activate pyserini
+$
+$ # When run as a script, will automatically evaluate and output NDCG@10
+$ python -m promptodile.index index.json shared.json
+```
+
+# Data
+For consistency, we attempt to follow the dataset formatting established by TREC (Text REtrieval Conference) as closely as possible.
+
+## BEIR
+Please visit [BEIR](https://huggingface.co/BeIR) for relevant datasets.
+
+You can use utility functions in `promptodile/utils.py` to convert the corpus and queries to TREC format.
+
+## Input Files
+
+### corpus.jsonl
+Contain all of the documents found in your corpus. Each json line in the file should contain three of five possible fields:
+1. `docid`
+2. `url` (not used)
+3. `title`
+4. `headings` (not used)
+5. `body`
+
+[more details](https://trec-rag.github.io/annoucements/2025-rag25-corpus/#document-structure)
+
+### queries.jsonl
+Contains the query text that maps to the Topics found in `examples.txt`. At minimum, the query text for each example must be provided in the following format:
+1. `id` (mapping to Topic in `examples.txt`)
+2. `narrative` (the query/topic's text)
+
+### qrels
+This is a text file containing whitespace-delimited rows for documents, topics (or queries), and relevance judgments. No headers are included, but each entry in a row maps to:
+1. `Topic`
+2. `Iteration`
+3. `Document#`
+4. `Relevancy`
+
+[more details](https://trec.nist.gov/data/qrels_eng/)
+
+### examples.txt
+If provided, represents the few-shot examples to be used in the query generation prompt. Uses the same formatting as the qrels text file.
+
+## Output Files
+
+### syn_queries.jsonl
+Generated and uses the same formatting as corpus.jsonl, but adds a `queries` field to each line:
+1. `docid`
+2. `url` (not used)
+3. `title`
+4. `headings` (not used)
+5. `body`
+6. `queries` (generated)
+
+The value for the `queries` field is a list containing each of the synthetic queries generated for the document.
+
+### runs
+A text file containing a ranked list of retrieved documents for a set of queries. This is generated after indexing to evaluate the finetuned model. Rows are white-space delimited and the entries in each row correspond to the following headers (not included in the file):
+1. `Topic ID`
+2. `Q0` (a fixed string)
+3. `docid`
+4. `Rank`
+5. `Score`
+6. `Run ID`
+
+[more details](https://trec-rag.github.io/annoucements/2025-track-guidelines/#output-format-ranked-results)
+
+# Disclosure
+DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
+
+This material is based upon work supported by the Department of the Air Force under Air Force Contract No. FA8702-15-D-0001 or FA8702-25-D-B002. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of the Air Force.
+
+© 2025 Massachusetts Institute of Technology.
+
+Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014)
+
+The software/firmware is provided to you on an As-Is basis
+
+Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.
@@ -0,0 +1 @@
+These files are legacy and incompatible with the codebase in the existing state. These files are kept to document the prompts used to generate synthetic queries for various BEIR datasets.
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a passage presenting an argument then generate a relevant counter argument. A counter argument is relevant if it considers and thoughtfully presents an opposing viewpoint to the claim(s) made in the argument. Use the following examples to guide you. Respond with only the counter argument.",
+    "user": "Argument: {}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a passage about an entity and generate a relevant query. A query is relevant if the passage contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
+    "user": "entity: {}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a passage containing factual information and generate a relevant claim. A claim is relevant if the passage contains all of the necessary evidence to support or refute the claim. Use the following examples to guide you. Respond with only the factual claim.",
+    "user": "{}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a passage from the finance domain and generate a relevant query. A query is relevant if the article contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
+    "user": "{}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a passage containing information and generate a relevant question. A question is relevant if the passage contains some or all of the necessary information to answer the question. Use the following examples to guide you. Respond with only the question.",
+    "user": "Passage: {}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read an article and generate a relevant query. A query is relevant if the article contains all of the necessary information to answer the query. Use the following examples to guide you. Respond with only the query.",
+    "user": "Article: {}",
+    "assistant": "{}"
+}
@@ -0,0 +1,5 @@
+{
+    "system": "You are a high-quality synthetic data generator. Your task is to read a scientific document and classify the document with a single-sentence summary. Use the following examples to guide you. Respond with only the single-sentence summary.",
+    "user": "{}",
+    "assistant": "{}"
+}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+These files are legacy and incompatible with the codebase in the existing state. These files are kept to document the prompts used to generate synthetic queries for various BEIR datasets.`