LLM Search Quality Evaluation Tutorial

Hi there, welcome to our LLM search quality evaluation tutorial!

This tutorial guides you through a complete workflow going from relevance-labeled dataset generation to comparing exact and approximate vector search performance evaluations.

There are two parts: the Dataset Generator and the Vector Search Doctor. The following diagram illustrates the overall workflow:

What You'll Learn

How to generate a relevance-labeled dataset for search quality evaluation?
How to run and evaluate an embedding model with exact vector search?
How to run an approximate vector search (ANN) and compare its performance against the exact vector search?

Prerequisites

Before we begin, ensure you have the following tools installed and configured:

Docker Desktop: We use dockerized Solr instance as a search engine.
- Install with docker desktop guide
- Verify with docker --version. Docker Desktop includes docker compose.
uv: Python package installer. This is used to set up the project's virtual environment.
- Install with uv install guide
For running our third tool, we need Java and Maven, so make sure both are installed in your machine.
Git LFS: we use it for storing large datasets in GitHub
- Install with: git lfs

Get Started

Now that you have the prerequisites installed, let's get the projects set up and running search quality evaluation.

What we do next:

Set up LLM Search Quality Evaluation & Tutorial Projects
Run Solr and Index Documents
Run Dataset Generator for relevance dataset creation
Run Embedding Model Evaluator for exact vector search performance
Run Approximate Search Evaluator for ANN search performance

Set up LLM Search Quality Evaluation & Tutorial Projects

There are 2 repos that we need to clone:

Firstly, clone this llm-search-quality-evaluation-tutorial repo:

git clone git@github.com:SeaseLtd/llm-search-quality-evaluation-tutorial.git

Secondly, clone this llm-search-quality-evaluation repo:

git clone git@github.com:SeaseLtd/llm-search-quality-evaluation.git

cd llm-search-quality-evaluation

We use uv to create a virtual environment in the project and install all the required packages:

uv sync

Run Solr and Index Documents

We use Solr and run dockerized Solr locally in this tutorial. There is a docker-services folder. We will be using a sub-folder solr-init inside the docker-services where you can find large dataset, Dockerfile, and solr_init.py.

In addition to running Solr instance, we need to index some documents for the actual search quality evaluation. We will use large dataset (dataset.json). The dataset comes from BBC news, and it contains around ~100k docs. As it's big, we use Git LFS to store in GitHub.

Run the following commands to get the actual content of the files instead of references: from the llm-search-quality-evaluation-tutorial repo:

llm-search-quality-evaluation-tutorial$ git lfs install
llm-search-quality-evaluation-tutorial$ git lfs pull

Next, run Solr (can be reached at http://localhost:8983/solr) and index the large dataset from the llm-search-quality-evaluation-tutorial repo::

llm-search-quality-evaluation-tutorial$ cd docker-services

docker-services$ docker compose -f docker-compose.solr.yml up --build

Run Dataset Generator

So now it's the core of the project, make sure you are in the LLM Search Quality Evaluation repo.

Dataset Generator is a CLI tool to generate relevance dataset for search evaluation. It retrieves documents from search engine, generates synthetic queries, and scores the relevance of document-query pairs using LLMs.

Before running, we need to set up a configuration file.

See dataset_generator_config.yaml
For detailed configuration info, see the README

Before running the dataset generator, we need to either set up a LLM configuration file (e.g. provide LLM model API key, add .env file with OPENAI_API_KEY)

or use our tmp datastore where the LLM generated queries and ratings are stored for the sake of time.

We need to copy the datastore from llm-search-quality-evaluation-tutorial repo to llm-search-quality-evaluation repo:

Either copy+paste or run the command below by substituting $absPath:

cp $absPath/llm-search-quality-evaluation-tutorial/datastore.json  ./resources/tmp

To run the dataset generator:

uv run dataset_generator --config examples/configs/dataset_generator/dataset_generator_config.yaml

This produces a relevance dataset file under resources dir which will be used in the next modules.

To know more about all the possible CLI parameters:

uv run dataset_generator --help

Run Embedding Model Evaluator

This tool is an MTEB benchmarking extension designed to evaluate embedding models on custom dataset, with a focus on retrieval and reranking tasks. It assesses model quality by using an exact vector search to establish a "ground truth" for retrieval performance on custom dataset.

Before running, we need to set up a configuration file.

See embedding_model_evaluator_config.yaml for an example.
For detailed configuration info, see the README.

To run the embedding model evaluator:

uv run embedding_model_evaluator --config examples/configs/vector_search_doctor/embedding_model_evaluator/embedding_model_evaluator_config.yaml

This outputs task evaluation result and relevance dataset embeddings (document and query embeddings). The embeddings are saved to resources/embeddings dir which will be used in the next module.

We will copy these embeddings to the llm-search-quality-evaluation-tutorial repo so that we will re-index the dataset with the doc embeddings in Solr.

Either copy+paste or run the command below by substituting $absPath:

cp ./resources/embeddings/documents_embeddings.jsonl  $absPath/llm-search-quality-evaluation-tutorial/embeddings

Then, we can re-index the document embeddings to Solr from the llm-search-quality-evaluation-tutorial repo:

llm-search-quality-evaluation-tutorial$ cd docker-services

docker-services$ docker compose -f docker-compose.solr.yml run --rm -e FORCE_REINDEX=true solr-init

Run Approximate Search Evaluator

This module tests ANN (approximate nearest neighbour) vector search used by the collection enriched with embeddings from the search engine.

Second, we need the relevance dataset with RRE format for this module. We run the Dataset Generator with output_format=RRE. Update the dataset generator config file and re-run:

uv run dataset_generator --config examples/configs/dataset_generator/dataset_generator_config.yaml

Once we have the relevance dataset (ratings.json) file, we set up a configuration file for this approximate search evaluator.

See approximate_search_evaluator_config.yaml for an example.
For detailed configuration info, see the README.

Note: Sease's RRE Maven archetype must be exposed to be found by your local machine. This is done by following the Step 1 in the guide in RRE Wiki.

Now we can run the approximate search evaluator:

uv run approximate_search_evaluator --config examples/configs/vector_search_doctor/approximate_search_evaluator/approximate_search_evaluator_config.yaml

This outputs approximate vector search evaluation result which can compared to the previous exact vector search result.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
docker-services		docker-services
.gitattributes		.gitattributes
.gitignore		.gitignore
LLM Search Quality Evaluation Tutorial.pdf		LLM Search Quality Evaluation Tutorial.pdf
README.md		README.md
datastore.json		datastore.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Search Quality Evaluation Tutorial

What You'll Learn

Prerequisites

Get Started

Set up LLM Search Quality Evaluation & Tutorial Projects

Run Solr and Index Documents

Run Dataset Generator

Run Embedding Model Evaluator

Run Approximate Search Evaluator

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

SeaseLtd/llm-search-quality-evaluation-tutorial

Folders and files

Latest commit

History

Repository files navigation

LLM Search Quality Evaluation Tutorial

What You'll Learn

Prerequisites

Get Started

Set up LLM Search Quality Evaluation & Tutorial Projects

Run Solr and Index Documents

Run Dataset Generator

Run Embedding Model Evaluator

Run Approximate Search Evaluator

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages