- Dataset Generator
- Vector Search Doctor
- Embedding Model Evaluator
- Approximate Search Evaluator
This tool provides a flexible command-line tool to generate relevance datasets for search evaluation. It can retrieve documents from a search engine, generate synthetic queries, and score the relevance of document-query pairs using LLMs.
This tool helps diagnose and optimize vector search performance by evaluating both embedding models and search configurations. It consists of two sub-tools that work together to identify bottlenecks and improve retrieval quality in your vector search pipeline.
This sub-tool extends MTEB benchmarking tool to test a HuggingFace embedding model performance on both Retrieval and Reranking tasks based on custom datasets.
This sub-tool provides a flexible tool to deply RRE and extract metrics to test your search engine collection given a template.
- uv: A fast Python package installer and resolver. To install uv follow the instructions here
- Python=3.10 version is fixed and widely used in the project, see .python-version file
First, create a virtual environment using uv following the file pyproject.toml. To do so, just execute:
# install dependencies (for users)
uv sync
# install development dependencies as well (e.g., mypy and ruff)
uv sync --group dev
# remove all cached packages
uv cache cleanBefore running the command below, you need to have running search engine instance
(solr/opensearch/elasticsearch/vespa).
For a detailed description to fill your configuration file (e.g., Config) you can look at the Dataset Generator README.
Execute the main script via CLI, pointing to your DAGE configuration file:
uv run dataset_generator --config <path-to-config-yaml>By default, the CLI is pointing to the
file inside the examples/configs/ directory.
To know more about all the possible CLI parameters, execute:
uv run dataset_generator --helpFor a detailed description to fill in configuration file (e.g., Config) you can look at the README.
Execute the main script via CLI, pointing to configuration file:
uv run embedding_model_evaluator --config <path-to-config-yaml>By default, the CLI is pointing to the
file inside
the examples/configs/ directory.
For a detailed description to fill in configuration file (e.g., Config) you can look at the README.
uv run approximate_search_evaluator --config <path-to-config-yaml>By default, the CLI is pointing to the
file
inside the examples/configs/ directory.
Execute pytest command as follows:
uv run pytestThe script will then:
- Fetch documents from the specified search engine.
- Generate or load queries.
- Score the relevance for each (document, query) pair.
- Save the output to the destination (specified in the config file).
This project uses:
# Check for issues
uv run ruff check .
# Auto-fix fixable issues
uv run ruff check --fix .
# Format code (if enabled)
uv run ruff format .# Run type checking
uv run mypy .Config Files
ruff.toml: Ruff linting rules and settings.mypy.ini: Mypy type checking rules and settings.