Skip to content

In the era of Large Language Models and vector search, this repository offers tools and processes to answer these three questions: 1) How can I build a dataset with rated queries to measure the quality of my search engine? 2) Is the embedding model the problem? 3) Is my implementation of approximate nearest neighbour search the problem?

License

Notifications You must be signed in to change notification settings

SeaseLtd/llm-search-quality-evaluation

Repository files navigation

LLM Search Quality Evaluation

Overview

  • Dataset Generator
  • Vector Search Doctor
    • Embedding Model Evaluator
    • Approximate Search Evaluator

This tool provides a flexible command-line tool to generate relevance datasets for search evaluation. It can retrieve documents from a search engine, generate synthetic queries, and score the relevance of document-query pairs using LLMs.

Vector Search Doctor

This tool helps diagnose and optimize vector search performance by evaluating both embedding models and search configurations. It consists of two sub-tools that work together to identify bottlenecks and improve retrieval quality in your vector search pipeline.

This sub-tool extends MTEB benchmarking tool to test a HuggingFace embedding model performance on both Retrieval and Reranking tasks based on custom datasets.

This sub-tool provides a flexible tool to deply RRE and extract metrics to test your search engine collection given a template.

Quickstart: tools installation

  • uv: A fast Python package installer and resolver. To install uv follow the instructions here
  • Python=3.10 version is fixed and widely used in the project, see .python-version file

First, create a virtual environment using uv following the file pyproject.toml. To do so, just execute:

# install dependencies (for users)
uv sync

# install development dependencies as well (e.g., mypy and ruff)
uv sync --group dev

# remove all cached packages
uv cache clean

Running Dataset Generator

Before running the command below, you need to have running search engine instance (solr/opensearch/elasticsearch/vespa).

For a detailed description to fill your configuration file (e.g., Config) you can look at the Dataset Generator README.

Execute the main script via CLI, pointing to your DAGE configuration file:

uv run dataset_generator --config <path-to-config-yaml>

By default, the CLI is pointing to the file inside the examples/configs/ directory.

To know more about all the possible CLI parameters, execute:

uv run dataset_generator --help

Running Embedding Model Evaluator

For a detailed description to fill in configuration file (e.g., Config) you can look at the README.

Execute the main script via CLI, pointing to configuration file:

uv run embedding_model_evaluator --config <path-to-config-yaml>

By default, the CLI is pointing to the file inside the examples/configs/ directory.

Running Approximate Search Evaluator

For a detailed description to fill in configuration file (e.g., Config) you can look at the README.

uv run approximate_search_evaluator --config <path-to-config-yaml>

By default, the CLI is pointing to the file inside the examples/configs/ directory.

Running tests

1. Unit Tests

Execute pytest command as follows:

uv run pytest

The script will then:

  1. Fetch documents from the specified search engine.
  2. Generate or load queries.
  3. Score the relevance for each (document, query) pair.
  4. Save the output to the destination (specified in the config file).

Code Quality Tools

This project uses:

  • Ruff for linting.
  • Mypy for static type checking.

Linting with Ruff

# Check for issues
uv run ruff check .

# Auto-fix fixable issues
uv run ruff check --fix .

# Format code (if enabled)
uv run ruff format .

Type Checking with Mypy

# Run type checking
uv run mypy .

Config Files

  • ruff.toml: Ruff linting rules and settings.
  • mypy.ini: Mypy type checking rules and settings.

About

In the era of Large Language Models and vector search, this repository offers tools and processes to answer these three questions: 1) How can I build a dataset with rated queries to measure the quality of my search engine? 2) Is the embedding model the problem? 3) Is my implementation of approximate nearest neighbour search the problem?

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages