Skip to content

feat: add doc retriever#39

Merged
leomaurodesenv merged 3 commits intomainfrom
feat/add-doc-retriever
Jul 31, 2025
Merged

feat: add doc retriever#39
leomaurodesenv merged 3 commits intomainfrom
feat/add-doc-retriever

Conversation

@leomaurodesenv
Copy link
Owner

feat(experiments): Add document retriever experiments

This pull request introduces a new experiment script to evaluate and compare different document retrieval models. This complements the existing document reader experiments by allowing us to benchmark the first crucial stage of a modern Question Answering pipeline: finding relevant documents from a large corpus.

The new script leverages the Haystack framework to test various retrieval algorithms like BM25, TF-IDF, and Dense Passage Retriever (DPR) against our datasets.

Key Changes:

  • New Document Retriever Experiment:

    • Adds experiments/doc_retriever.py, a new script dedicated to running document retrieval evaluations.
    • It builds a Haystack DocumentSearchPipeline to measure the performance of different retrievers.
  • Supported Retriever Models:

    • A retriever_switch function has been implemented to easily select between:
      • BM25Retriever
      • TfidfRetriever
      • DensePassageRetriever (DPR)
  • Dynamic Configuration with Argparse:

    • The script integrates argparse to allow for flexible configuration from the command line. Users can now specify:
      • --model: The retriever model to use (e.g., BM25, DPR).
      • --dataset: The dataset for evaluation (e.g., QASports, SQuAD).
      • --sport: The specific sport for the QASports dataset.
      • --num_k: The number of top documents to retrieve.
  • Refactoring:

    • The dataset_switch function in experiments/module.py was refactored for better clarity and to seamlessly support both the document reader and the new document retriever experiments.

How to Run the New Experiment:

The README.md has been updated with instructions. You can run the experiment as follows:

# See all available options for the document retriever experiment
$ uv run -m experiments.doc_retriever --help

# Example: Run with the BM25 model on the QASports basketball dataset, retrieving the top 5 documents
$ uv run -m experiments.doc_retriever --model BM25 --dataset QASports --sport BASKETBALL --num_k 5

Local Tests

➜  qasports-dataset-scripts git:(feat/add-doc-retriever) ✗ uv run -m experiments.doc_retriever --num_k 3 --dataset SQuAD --model DPR  
Dataset: Dataset.SQuAD // Sport: all
Model: DocRetriever.DPR // Top-K: 3
## SQuaD Dataset ##
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})
Updating BM25 representation...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 26588.30 docs/s]
Documents Processed: 10000 docs [00:27, 360.40 docs/s]                                                                                                                                           
Retriever: <haystack.nodes.retriever.dense.DensePassageRetriever object at 0x744ab9e85090>                                                                                                       
{'Retriever': {'recall_multi_hit': 0.6666666666666666, 'recall_single_hit': 0.6666666666666666, 'precision': 0.2777777777777778, 'map': 0.8611111111111112, 'mrr': 0.6666666666666666, 'ndcg': 0.7347941325294953}}

➜  qasports-dataset-scripts git:(feat/add-doc-retriever) ✗ uv run -m experiments.doc_retriever --num_k 3 --dataset SQuAD --model TFIDF
Dataset: Dataset.SQuAD // Sport: all
Model: DocRetriever.TFIDF // Top-K: 3
## SQuaD Dataset ##
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})
Updating BM25 representation...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 25522.11 docs/s]
Retriever: <haystack.nodes.retriever.sparse.TfidfRetriever object at 0x7be429d9c5e0>
{'Retriever': {'recall_multi_hit': 0.5555555555555556, 'recall_single_hit': 0.5555555555555556, 'precision': 0.18518518518518517, 'map': 0.42592592592592593, 'mrr': 0.42592592592592593, 'ndcg': 0.45899219484127307}}

➜  qasports-dataset-scripts git:(feat/add-doc-retriever) ✗ uv run -m experiments.doc_retriever --num_k 3 --dataset SQuAD --model BM25 
Dataset: Dataset.SQuAD // Sport: all
Model: DocRetriever.BM25 // Top-K: 3
## SQuaD Dataset ##
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})
Updating BM25 representation...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 25452.42 docs/s]
Retriever: <haystack.nodes.retriever.sparse.BM25Retriever object at 0x7f06f26828c0>
{'Retriever': {'recall_multi_hit': 0.8148148148148148, 'recall_single_hit': 0.8888888888888888, 'precision': 0.35185185185185186, 'map': 0.7037037037037037, 'mrr': 0.6481481481481481, 'ndcg': 0.7141124071708137}}

@leomaurodesenv leomaurodesenv self-assigned this Jul 31, 2025
@leomaurodesenv leomaurodesenv added the enhancement New feature or request label Jul 31, 2025
@leomaurodesenv leomaurodesenv merged commit 27a6fc4 into main Jul 31, 2025
1 check passed
@leomaurodesenv leomaurodesenv deleted the feat/add-doc-retriever branch July 31, 2025 01:03
@leomaurodesenv
Copy link
Owner Author

Supported by the discussion deepset-ai/haystack#1305

leomaurodesenv added a commit that referenced this pull request Aug 1, 2025
* feat(experiments): add doc retriever

* feat(experiments): refactoring dataset switch

* feat(experiments): add argparsers to doc retriever
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant