|
1 | | -# promptodile |
| 1 | +# :crocodile: Promptodile |
| 2 | + |
| 3 | +# Overview |
| 4 | +Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (<=14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications. |
| 5 | + |
| 6 | +# Citation |
| 7 | +```bibtex |
| 8 | +@inproceedings{gwon_2025_promptodile, |
| 9 | + title={Study on LLMs for Promptagator-Style Dense Retriever Training}, |
| 10 | + author={Gwon, Daniel and Jedidi, Nour and Lin, Jimmy}, |
| 11 | + booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management}, |
| 12 | + series = {CIKM '25}, |
| 13 | + year = {2025}, |
| 14 | + pages = {XXX--XXX}, |
| 15 | + publisher = {Association for Computing Machinery}, |
| 16 | +} |
| 17 | +``` |
| 18 | + |
| 19 | +# Installation |
| 20 | +For evaluation of our retrieval methods, we used [Pyserini](https://github.com/castorini/pyserini). Unfortunately, we found some issues when using Pyserini with the other required packages. Thus, we recommend creating separate virtual environments - one for query generation & retriever training and the other for indexing & evaluation. |
| 21 | + |
| 22 | +## Query Generation and Retriever Training |
| 23 | +Uses the latest version of python 3.12 and CUDA 12.8 (see [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#installation) for details) |
| 24 | + |
| 25 | +```bash |
| 26 | +$ # Create your environment. |
| 27 | +$ conda create -n promptodile python=3.12 -y |
| 28 | +$ conda activate promptodile |
| 29 | +$ |
| 30 | +$ # First, install vllm with torch-backend flag. |
| 31 | +$ uv pip install vllm --torch-backend=auto |
| 32 | +$ |
| 33 | +$ # Then, install the rest of your packages. |
| 34 | +$ uv pip install -r requirements.txt |
| 35 | +``` |
| 36 | + |
| 37 | +## Index/Evaluation |
| 38 | +To create a custom environment for Pyserini, users should visit [here](https://github.com/castorini/pyserini/blob/master/docs/installation.md#pypi-installation-walkthrough) for detailed instructions. |
| 39 | + |
| 40 | +Note that the optional dependency, `faiss-cpu`, is only intended if you plan to index your corpus on CPU; use `faiss-gpu` to index on GPU. The linux installation instructions are also for CPU; if you have a GPU, please adjust the PyTorch index-url based on your CUDA version. We've found the faiss [installation instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) more useful to install `faiss-cpu` or `faiss-gpu`. |
| 41 | + |
| 42 | +We use Python 3.11 to install Pyserini and CUDA 12.4 to install PyTorch & faiss-gpu. We run evaluation with Pyserini on Linux, and downgrade numpy to 1.26.4. |
| 43 | + |
| 44 | +```bash |
| 45 | +$ # See links above for installation instructions for pyserini |
| 46 | +$ conda activate pyserini |
| 47 | +``` |
| 48 | + |
| 49 | +# Usage |
| 50 | +Usage can be broken down into three steps: |
| 51 | +1. Query Generation |
| 52 | +2. Retriever Model Training |
| 53 | +3. Index/Evaluation |
| 54 | + |
| 55 | +There is considerable variation in configurations across each of the three steps. In addition to a shared configuration file, each step has its own configuration file. |
| 56 | +1. qgen.json |
| 57 | +2. train.json |
| 58 | +3. eval.json |
| 59 | +4. shared.json |
| 60 | + |
| 61 | +Examples have been provided in `./configs/templates` to train [contriever](https://huggingface.co/facebook/contriever) and [e5](https://huggingface.co/facebook/contriever) backbone models. |
| 62 | + |
| 63 | +## Query Generation |
| 64 | +Query generation is designed for offline batched inference using [vLLM](https://docs.vllm.ai/en/stable/). Furthermore, the package is designed to use instruct models, so chat templates should be used for best performance. |
| 65 | + |
| 66 | +```bash |
| 67 | +$ conda activate promptodile |
| 68 | +$ python -m promptodile.query_generation.generate qgen.json shared.json |
| 69 | +``` |
| 70 | + |
| 71 | +## Retriever Training |
| 72 | +```bash |
| 73 | +$ conda activate promptodile |
| 74 | +$ accelerate launch --num_processes=GPUS -m promptodile.train train.json shared.json |
| 75 | +``` |
| 76 | + |
| 77 | +## Index/Evaluation |
| 78 | +```bash |
| 79 | +$ conda activate pyserini |
| 80 | +$ |
| 81 | +$ # When run as a script, will automatically evaluate and output NDCG@10 |
| 82 | +$ python -m promptodile.index index.json shared.json |
| 83 | +``` |
| 84 | + |
| 85 | +# Data |
| 86 | +For consistency, we attempt to follow the dataset formatting established by TREC (Text REtrieval Conference) as closely as possible. |
| 87 | + |
| 88 | +## BEIR |
| 89 | +Please visit [BEIR](https://huggingface.co/BeIR) for relevant datasets. |
| 90 | + |
| 91 | +You can use utility functions in `promptodile/utils.py` to convert the corpus and queries to TREC format. |
| 92 | + |
| 93 | +## Input Files |
| 94 | + |
| 95 | +### corpus.jsonl |
| 96 | +Contain all of the documents found in your corpus. Each json line in the file should contain three of five possible fields: |
| 97 | +1. `docid` |
| 98 | +2. `url` (not used) |
| 99 | +3. `title` |
| 100 | +4. `headings` (not used) |
| 101 | +5. `body` |
| 102 | + |
| 103 | +[more details](https://trec-rag.github.io/annoucements/2025-rag25-corpus/#document-structure) |
| 104 | + |
| 105 | +### queries.jsonl |
| 106 | +Contains the query text that maps to the Topics found in `examples.txt`. At minimum, the query text for each example must be provided in the following format: |
| 107 | +1. `id` (mapping to Topic in `examples.txt`) |
| 108 | +2. `narrative` (the query/topic's text) |
| 109 | + |
| 110 | +### qrels |
| 111 | +This is a text file containing whitespace-delimited rows for documents, topics (or queries), and relevance judgments. No headers are included, but each entry in a row maps to: |
| 112 | +1. `Topic` |
| 113 | +2. `Iteration` |
| 114 | +3. `Document#` |
| 115 | +4. `Relevancy` |
| 116 | + |
| 117 | +[more details](https://trec.nist.gov/data/qrels_eng/) |
| 118 | + |
| 119 | +### examples.txt |
| 120 | +If provided, represents the few-shot examples to be used in the query generation prompt. Uses the same formatting as the qrels text file. |
| 121 | + |
| 122 | +## Output Files |
| 123 | + |
| 124 | +### syn_queries.jsonl |
| 125 | +Generated and uses the same formatting as corpus.jsonl, but adds a `queries` field to each line: |
| 126 | +1. `docid` |
| 127 | +2. `url` (not used) |
| 128 | +3. `title` |
| 129 | +4. `headings` (not used) |
| 130 | +5. `body` |
| 131 | +6. `queries` (generated) |
| 132 | + |
| 133 | +The value for the `queries` field is a list containing each of the synthetic queries generated for the document. |
| 134 | + |
| 135 | +### runs |
| 136 | +A text file containing a ranked list of retrieved documents for a set of queries. This is generated after indexing to evaluate the finetuned model. Rows are white-space delimited and the entries in each row correspond to the following headers (not included in the file): |
| 137 | +1. `Topic ID` |
| 138 | +2. `Q0` (a fixed string) |
| 139 | +3. `docid` |
| 140 | +4. `Rank` |
| 141 | +5. `Score` |
| 142 | +6. `Run ID` |
| 143 | + |
| 144 | +[more details](https://trec-rag.github.io/annoucements/2025-track-guidelines/#output-format-ranked-results) |
| 145 | + |
| 146 | +# Disclosure |
| 147 | +DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. |
| 148 | + |
| 149 | +This material is based upon work supported by the Department of the Air Force under Air Force Contract No. FA8702-15-D-0001 or FA8702-25-D-B002. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of the Air Force. |
| 150 | + |
| 151 | +© 2025 Massachusetts Institute of Technology. |
| 152 | + |
| 153 | +Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014) |
| 154 | + |
| 155 | +The software/firmware is provided to you on an As-Is basis |
| 156 | + |
| 157 | +Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work. |
0 commit comments