|
| 1 | +# Benchmarking embedding models with BEIR Datasets and Llama Stack |
| 2 | + |
| 3 | +## Purpose |
| 4 | +The purpose of this script is to compare retrieval accuracy between embedding models using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework. |
| 5 | + |
| 6 | +## Setup |
| 7 | +For the examples we use Ollama to serve the model which can easily be swapped for an inference provider of your choice. |
| 8 | + |
| 9 | +Initialize a virtual environment: |
| 10 | +``` bash |
| 11 | +uv venv .venv --python 3.12 --seed |
| 12 | +source .venv/bin/activate |
| 13 | +``` |
| 14 | + |
| 15 | +Install the script's dependencies: |
| 16 | +``` bash |
| 17 | +uv pip install -r requirements.txt |
| 18 | +``` |
| 19 | + |
| 20 | +Prepare your environment by running: |
| 21 | +``` bash |
| 22 | +# The run.yaml file is based on starter template https://github.com/meta-llama/llama-stack/tree/main/llama_stack/templates/starter |
| 23 | +# We run a build here to install all of the dependencies for the starter template |
| 24 | +llama stack build --template starter --image-type venv |
| 25 | +``` |
| 26 | + |
| 27 | +## Running Instructions |
| 28 | + |
| 29 | +### Basic Usage |
| 30 | +To run the script with default settings: |
| 31 | + |
| 32 | +```bash |
| 33 | +# Update OLLAMA_INFERENCE_MODEL to your preferred model or similar for other inference providers |
| 34 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py |
| 35 | +``` |
| 36 | + |
| 37 | +## Supported Embedding Models |
| 38 | + |
| 39 | +Default supported embedding models: |
| 40 | +- `granite-embedding-30m`: IBM Granite 30M parameter embedding model |
| 41 | +- `granite-embedding-125m`: IBM Granite 125M parameter embedding model |
| 42 | + |
| 43 | +It is possible to add more embedding models using the [Llama Stack Python Client](https://github.com/llamastack/llama-stack-client-python) |
| 44 | + |
| 45 | +### Adding additional embedding models |
| 46 | +Below is an example of how you can add more embedding models to the models list. |
| 47 | +``` bash |
| 48 | +# First run the llama stack server via the run file |
| 49 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run llama stack run run.yaml |
| 50 | +``` |
| 51 | +``` bash |
| 52 | +# Adding the all-MiniLM-L6-v2 model via the llama-stack-client |
| 53 | +llama-stack-client models register all-MiniLM-L6-v2 --provider-id sentence-transformers --provider-model-id all-minilm:latest --metadata '{"embedding_dimension": 384}' --model-type embedding |
| 54 | +``` |
| 55 | +> [!NOTE] |
| 56 | +> Shut down the Llama Stack server before running the benchmark |
| 57 | +
|
| 58 | +### Command-Line Options |
| 59 | + |
| 60 | +#### `--dataset-names` |
| 61 | +**Description:** Specifies which BEIR datasets to use for benchmarking. |
| 62 | + |
| 63 | +- **Type:** List of strings |
| 64 | +- **Default:** `["scifact"]` |
| 65 | +- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets) |
| 66 | +- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets |
| 67 | + |
| 68 | +**Example:** |
| 69 | +```bash |
| 70 | +# Single dataset |
| 71 | +--dataset-names scifact |
| 72 | + |
| 73 | +# Multiple datasets |
| 74 | +--dataset-names scifact scidocs nq |
| 75 | +``` |
| 76 | + |
| 77 | +#### `--embedding-models` |
| 78 | +**Description:** Specifies which embedding models to benchmark against each other. |
| 79 | + |
| 80 | +- **Type:** List of strings |
| 81 | +- **Default:** `["granite-embedding-30m", "granite-embedding-125m"]` |
| 82 | +- **Requirement:** Embedding models must be defined in the `run.yaml` file |
| 83 | +- **Purpose:** Compare performance across different embedding models |
| 84 | + |
| 85 | +**Example:** |
| 86 | +```bash |
| 87 | +# Default models |
| 88 | +--embedding-models granite-embedding-30m granite-embedding-125m |
| 89 | + |
| 90 | +# Custom model selection |
| 91 | +--embedding-models all-MiniLM-L6-v2 granite-embedding-125m |
| 92 | +``` |
| 93 | + |
| 94 | +#### `--custom-datasets-urls` |
| 95 | +**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets. |
| 96 | + |
| 97 | +- **Type:** List of strings |
| 98 | +- **Default:** `[]` (empty - uses standard BEIR datasets) |
| 99 | +- **Requirement:** Must be used together with `--dataset-names` flag |
| 100 | +- **Format:** URLs pointing to BEIR-compatible dataset archives |
| 101 | + |
| 102 | +**Example:** |
| 103 | +```bash |
| 104 | +# Using custom datasets |
| 105 | +--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip |
| 106 | +``` |
| 107 | + |
| 108 | +#### `--batch-size` |
| 109 | +**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database. |
| 110 | + |
| 111 | +- **Type:** Integer |
| 112 | +- **Default:** `150` |
| 113 | +- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections |
| 114 | +- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower |
| 115 | + |
| 116 | +**Example:** |
| 117 | +```bash |
| 118 | +# Using smaller batch size for memory-constrained environments |
| 119 | +--batch-size 50 |
| 120 | + |
| 121 | +# Using larger batch size for faster processing |
| 122 | +--batch-size 300 |
| 123 | +``` |
| 124 | + |
| 125 | +> [!NOTE] |
| 126 | + Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents. |
| 127 | + |
| 128 | +``` text |
| 129 | +dataset-name.zip/ |
| 130 | +├── qrels/ |
| 131 | +│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores |
| 132 | +├── corpus.jsonl # Document collection with document IDs, titles, and text content |
| 133 | +└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation |
| 134 | +``` |
| 135 | + |
| 136 | +**test.tsv** |
| 137 | + |
| 138 | +| query-id | corpus-id | score | |
| 139 | +|----------|-----------|-------| |
| 140 | +| 0 | 0 | 1 | |
| 141 | +| 1 | 1 | 1 | |
| 142 | + |
| 143 | +**corpus.jsonl** |
| 144 | +``` json |
| 145 | +{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}} |
| 146 | +{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}} |
| 147 | +``` |
| 148 | + |
| 149 | +**queries.jsonl** |
| 150 | +``` json |
| 151 | +{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}} |
| 152 | +{"_id": "1", "text": "Captain of the Pequod", "metadata": {}} |
| 153 | +``` |
| 154 | + |
| 155 | +### Usage Examples |
| 156 | + |
| 157 | +**Basic benchmarking with default settings:** |
| 158 | +```bash |
| 159 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py |
| 160 | +``` |
| 161 | + |
| 162 | +**Basic benchmarking with larger batch size:** |
| 163 | +```bash |
| 164 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --batch-size 300 |
| 165 | +``` |
| 166 | + |
| 167 | +**Benchmark multiple datasets:** |
| 168 | +```bash |
| 169 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \ |
| 170 | + --dataset-names scifact scidocs |
| 171 | +``` |
| 172 | + |
| 173 | +**Compare specific embedding models:** |
| 174 | +```bash |
| 175 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \ |
| 176 | + --embedding-models granite-embedding-30m all-MiniLM-L6-v2 |
| 177 | +``` |
| 178 | + |
| 179 | +**Use custom datasets:** |
| 180 | +```bash |
| 181 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \ |
| 182 | + --dataset-names my-dataset \ |
| 183 | + --custom-datasets-urls https://example.com/my-beir-dataset.zip |
| 184 | +``` |
| 185 | + |
| 186 | +### Sample Output |
| 187 | +Below is sample outputs for the following datasets: |
| 188 | +* scifact |
| 189 | +* fiqa |
| 190 | +* arguana |
| 191 | + |
| 192 | +> [!NOTE] |
| 193 | + Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest. |
| 194 | + |
| 195 | +``` |
| 196 | +Scoring |
| 197 | +All results in <path-to>/rag/benchmarks/embedding-models-with-beir/results |
| 198 | +
|
| 199 | +scifact map_cut_10 |
| 200 | + granite-embedding-125m : 0.6879 |
| 201 | + granite-embedding-30m : 0.6578 |
| 202 | + p_value : 0.0150 |
| 203 | + p_value<0.05 so this result is statistically significant |
| 204 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 205 | +
|
| 206 | +
|
| 207 | +scifact map_cut_5 |
| 208 | + granite-embedding-125m : 0.6767 |
| 209 | + granite-embedding-30m : 0.6481 |
| 210 | + p_value : 0.0294 |
| 211 | + p_value<0.05 so this result is statistically significant |
| 212 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 213 | +
|
| 214 | +
|
| 215 | +scifact ndcg_cut_10 |
| 216 | + granite-embedding-125m : 0.7350 |
| 217 | + granite-embedding-30m : 0.7018 |
| 218 | + p_value : 0.0026 |
| 219 | + p_value<0.05 so this result is statistically significant |
| 220 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 221 | +
|
| 222 | +
|
| 223 | +scifact ndcg_cut_5 |
| 224 | + granite-embedding-125m : 0.7119 |
| 225 | + granite-embedding-30m : 0.6833 |
| 226 | + p_value : 0.0256 |
| 227 | + p_value<0.05 so this result is statistically significant |
| 228 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 229 | +
|
| 230 | +
|
| 231 | +fiqa map_cut_10 |
| 232 | + granite-embedding-125m : 0.3581 |
| 233 | + granite-embedding-30m : 0.2829 |
| 234 | + p_value : 0.0002 |
| 235 | + p_value<0.05 so this result is statistically significant |
| 236 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 237 | +
|
| 238 | +
|
| 239 | +fiqa map_cut_5 |
| 240 | + granite-embedding-125m : 0.3395 |
| 241 | + granite-embedding-30m : 0.2664 |
| 242 | + p_value : 0.0002 |
| 243 | + p_value<0.05 so this result is statistically significant |
| 244 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 245 | +
|
| 246 | +
|
| 247 | +fiqa ndcg_cut_10 |
| 248 | + granite-embedding-125m : 0.4411 |
| 249 | + granite-embedding-30m : 0.3599 |
| 250 | + p_value : 0.0002 |
| 251 | + p_value<0.05 so this result is statistically significant |
| 252 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 253 | +
|
| 254 | +
|
| 255 | +fiqa ndcg_cut_5 |
| 256 | + granite-embedding-125m : 0.4176 |
| 257 | + granite-embedding-30m : 0.3355 |
| 258 | + p_value : 0.0002 |
| 259 | + p_value<0.05 so this result is statistically significant |
| 260 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 261 | +
|
| 262 | +
|
| 263 | +arguana map_cut_10 |
| 264 | + granite-embedding-125m : 0.2927 |
| 265 | + granite-embedding-30m : 0.2821 |
| 266 | + p_value : 0.0104 |
| 267 | + p_value<0.05 so this result is statistically significant |
| 268 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 269 | +
|
| 270 | +
|
| 271 | +arguana map_cut_5 |
| 272 | + granite-embedding-125m : 0.2707 |
| 273 | + granite-embedding-30m : 0.2594 |
| 274 | + p_value : 0.0216 |
| 275 | + p_value<0.05 so this result is statistically significant |
| 276 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 277 | +
|
| 278 | +
|
| 279 | +arguana ndcg_cut_10 |
| 280 | + granite-embedding-125m : 0.4251 |
| 281 | + granite-embedding-30m : 0.4124 |
| 282 | + p_value : 0.0044 |
| 283 | + p_value<0.05 so this result is statistically significant |
| 284 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 285 | +
|
| 286 | +
|
| 287 | +arguana ndcg_cut_5 |
| 288 | + granite-embedding-125m : 0.3718 |
| 289 | + granite-embedding-30m : 0.3582 |
| 290 | + p_value : 0.0292 |
| 291 | + p_value<0.05 so this result is statistically significant |
| 292 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 293 | +``` |
0 commit comments