|
| 1 | +# Benchmarking embedding models with BEIR Datasets with Llama Stack |
| 2 | + |
| 3 | +## Purpose |
| 4 | +The purpose of this script is to compare retrieval accuracy between embedding models using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework. |
| 5 | + |
| 6 | +Setup a virtual environment: |
| 7 | +``` bash |
| 8 | +uv venv .venv --python 3.12 --seed |
| 9 | +source .venv/bin/activate |
| 10 | +``` |
| 11 | + |
| 12 | +Install the script's dependencies: |
| 13 | +``` bash |
| 14 | +uv pip install -r requirements.txt |
| 15 | +``` |
| 16 | + |
| 17 | +Prepare your environment by running: |
| 18 | +``` bash |
| 19 | +llama stack build --template ollama --image-type venv |
| 20 | +``` |
| 21 | + |
| 22 | +### About the run.yaml file |
| 23 | +* The run.yaml file makes use of Milvus inline as its vector database. |
| 24 | +* There are 2 default embedding models `ibm-granite/granite-embedding-125m-english` and `ibm-granite/granite-embedding-30m-english` |
| 25 | + |
| 26 | +To add your own embedding models you can update the `models` section of the `run.yaml` file. |
| 27 | +``` yaml |
| 28 | +# Example adding <example-model> embedding model with sentence-transformers as its provider |
| 29 | +models: |
| 30 | +- metadata: {} |
| 31 | + model_id: ${env.INFERENCE_MODEL} |
| 32 | + provider_id: ollama |
| 33 | + model_type: llm |
| 34 | +- metadata: |
| 35 | + embedding_dimension: 768 |
| 36 | + model_id: granite-embedding-125m |
| 37 | + provider_id: sentence-transformers |
| 38 | + provider_model_id: ibm-granite/granite-embedding-125m-english |
| 39 | + model_type: embedding |
| 40 | +- metadata: |
| 41 | + embedding_dimension: <int> |
| 42 | + model_id: <example-model> |
| 43 | + provider_id: sentence-transformers |
| 44 | + provider_model_id: sentence-transformers/<example-model> |
| 45 | + model_type: embedding |
| 46 | +``` |
| 47 | +
|
| 48 | +
|
| 49 | +## Running Instructions |
| 50 | +
|
| 51 | +### Basic Usage |
| 52 | +To run the script with default settings: |
| 53 | +
|
| 54 | +```bash |
| 55 | +# Update INFERENCE_MODEL to your preferred model served by Ollama |
| 56 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py |
| 57 | +``` |
| 58 | + |
| 59 | +### Command-Line Options |
| 60 | + |
| 61 | +#### `--dataset-names` |
| 62 | +**Description:** Specifies which BEIR datasets to use for benchmarking. |
| 63 | + |
| 64 | +- **Type:** List of strings |
| 65 | +- **Default:** `["scifact"]` |
| 66 | +- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets) |
| 67 | +- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets |
| 68 | + |
| 69 | +**Example:** |
| 70 | +```bash |
| 71 | +# Single dataset |
| 72 | +--dataset-names scifact |
| 73 | + |
| 74 | +# Multiple datasets |
| 75 | +--dataset-names scifact scidocs nq |
| 76 | +``` |
| 77 | + |
| 78 | +#### `--embedding-models` |
| 79 | +**Description:** Specifies which embedding models to benchmark against each other. |
| 80 | + |
| 81 | +- **Type:** List of strings |
| 82 | +- **Default:** `["granite-embedding-30m", "granite-embedding-125m"]` |
| 83 | +- **Requirement:** Embedding models must be defined in the `run.yaml` file |
| 84 | +- **Purpose:** Compare performance across different embedding models |
| 85 | + |
| 86 | +**Example:** |
| 87 | +```bash |
| 88 | +# Default models |
| 89 | +--embedding-models granite-embedding-30m granite-embedding-125m |
| 90 | + |
| 91 | +# Custom model selection |
| 92 | +--embedding-models all-MiniLM-L6-v2 granite-embedding-125m |
| 93 | +``` |
| 94 | + |
| 95 | +#### `--custom-datasets-urls` |
| 96 | +**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets. |
| 97 | + |
| 98 | +- **Type:** List of strings |
| 99 | +- **Default:** `[]` (empty - uses standard BEIR datasets) |
| 100 | +- **Requirement:** Must be used together with `--dataset-names` flag |
| 101 | +- **Format:** URLs pointing to BEIR-compatible dataset archives |
| 102 | + |
| 103 | +**Example:** |
| 104 | +```bash |
| 105 | +# Using custom datasets |
| 106 | +--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip |
| 107 | +``` |
| 108 | + |
| 109 | +#### `--batch-size` |
| 110 | +**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database. |
| 111 | + |
| 112 | +- **Type:** Integer |
| 113 | +- **Default:** `150` |
| 114 | +- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections |
| 115 | +- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower |
| 116 | + |
| 117 | +**Example:** |
| 118 | +```bash |
| 119 | +# Using smaller batch size for memory-constrained environments |
| 120 | +--batch-size 50 |
| 121 | + |
| 122 | +# Using larger batch size for faster processing |
| 123 | +--batch-size 300 |
| 124 | +``` |
| 125 | + |
| 126 | +> [!NOTE] |
| 127 | + Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents. |
| 128 | + |
| 129 | +``` text |
| 130 | +dataset-name.zip/ |
| 131 | +├── qrels/ |
| 132 | +│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores |
| 133 | +├── corpus.jsonl # Document collection with document IDs, titles, and text content |
| 134 | +└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation |
| 135 | +``` |
| 136 | + |
| 137 | +**test.tsv** |
| 138 | + |
| 139 | +| query-id | corpus-id | score | |
| 140 | +|----------|-----------|-------| |
| 141 | +| 0 | 0 | 1 | |
| 142 | +| 1 | 1 | 1 | |
| 143 | + |
| 144 | +**corpus.jsonl** |
| 145 | +``` json |
| 146 | +{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}} |
| 147 | +{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}} |
| 148 | +``` |
| 149 | + |
| 150 | +**queries.jsonl** |
| 151 | +``` json |
| 152 | +{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}} |
| 153 | +{"_id": "1", "text": "Captain of the Pequod", "metadata": {}} |
| 154 | +``` |
| 155 | + |
| 156 | +### Usage Examples |
| 157 | + |
| 158 | +**Basic benchmarking with default settings:** |
| 159 | +```bash |
| 160 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py |
| 161 | +``` |
| 162 | + |
| 163 | +**Basic benchmarking with larger batch size:** |
| 164 | +```bash |
| 165 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py --batch-size 300 |
| 166 | +``` |
| 167 | + |
| 168 | +**Benchmark multiple datasets:** |
| 169 | +```bash |
| 170 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \ |
| 171 | + --dataset-names scifact scidocs |
| 172 | +``` |
| 173 | + |
| 174 | +**Compare specific embedding models:** |
| 175 | +```bash |
| 176 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \ |
| 177 | + --embedding-models granite-embedding-30m all-MiniLM-L6-v2 |
| 178 | +``` |
| 179 | + |
| 180 | +**Use custom datasets:** |
| 181 | +```bash |
| 182 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \ |
| 183 | + --dataset-names my-dataset \ |
| 184 | + --custom-datasets-urls https://example.com/my-beir-dataset.zip |
| 185 | +``` |
| 186 | + |
| 187 | +### Sample Output |
| 188 | +Below is sample outputs for the following datasets: |
| 189 | +* scifact |
| 190 | +* fiqa |
| 191 | +* arguana |
| 192 | + |
| 193 | +> [!NOTE] |
| 194 | + Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest. |
| 195 | + |
| 196 | +``` |
| 197 | +Scoring |
| 198 | +All results in <path-to>/rag/benchmarks/embedding-models-with-beir/results |
| 199 | +
|
| 200 | +scifact map_cut_10 |
| 201 | + granite-embedding-125m : 0.6879 |
| 202 | + granite-embedding-30m : 0.6578 |
| 203 | + p_value : 0.0150 |
| 204 | + p_value<0.05 so this result is statistically significant |
| 205 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 206 | +
|
| 207 | +
|
| 208 | +scifact map_cut_5 |
| 209 | + granite-embedding-125m : 0.6767 |
| 210 | + granite-embedding-30m : 0.6481 |
| 211 | + p_value : 0.0294 |
| 212 | + p_value<0.05 so this result is statistically significant |
| 213 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 214 | +
|
| 215 | +
|
| 216 | +scifact ndcg_cut_10 |
| 217 | + granite-embedding-125m : 0.7350 |
| 218 | + granite-embedding-30m : 0.7018 |
| 219 | + p_value : 0.0026 |
| 220 | + p_value<0.05 so this result is statistically significant |
| 221 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 222 | +
|
| 223 | +
|
| 224 | +scifact ndcg_cut_5 |
| 225 | + granite-embedding-125m : 0.7119 |
| 226 | + granite-embedding-30m : 0.6833 |
| 227 | + p_value : 0.0256 |
| 228 | + p_value<0.05 so this result is statistically significant |
| 229 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 230 | +
|
| 231 | +
|
| 232 | +fiqa map_cut_10 |
| 233 | + granite-embedding-125m : 0.3581 |
| 234 | + granite-embedding-30m : 0.2829 |
| 235 | + p_value : 0.0002 |
| 236 | + p_value<0.05 so this result is statistically significant |
| 237 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 238 | +
|
| 239 | +
|
| 240 | +fiqa map_cut_5 |
| 241 | + granite-embedding-125m : 0.3395 |
| 242 | + granite-embedding-30m : 0.2664 |
| 243 | + p_value : 0.0002 |
| 244 | + p_value<0.05 so this result is statistically significant |
| 245 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 246 | +
|
| 247 | +
|
| 248 | +fiqa ndcg_cut_10 |
| 249 | + granite-embedding-125m : 0.4411 |
| 250 | + granite-embedding-30m : 0.3599 |
| 251 | + p_value : 0.0002 |
| 252 | + p_value<0.05 so this result is statistically significant |
| 253 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 254 | +
|
| 255 | +
|
| 256 | +fiqa ndcg_cut_5 |
| 257 | + granite-embedding-125m : 0.4176 |
| 258 | + granite-embedding-30m : 0.3355 |
| 259 | + p_value : 0.0002 |
| 260 | + p_value<0.05 so this result is statistically significant |
| 261 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 262 | +
|
| 263 | +
|
| 264 | +arguana map_cut_10 |
| 265 | + granite-embedding-125m : 0.2927 |
| 266 | + granite-embedding-30m : 0.2821 |
| 267 | + p_value : 0.0104 |
| 268 | + p_value<0.05 so this result is statistically significant |
| 269 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 270 | +
|
| 271 | +
|
| 272 | +arguana map_cut_5 |
| 273 | + granite-embedding-125m : 0.2707 |
| 274 | + granite-embedding-30m : 0.2594 |
| 275 | + p_value : 0.0216 |
| 276 | + p_value<0.05 so this result is statistically significant |
| 277 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 278 | +
|
| 279 | +
|
| 280 | +arguana ndcg_cut_10 |
| 281 | + granite-embedding-125m : 0.4251 |
| 282 | + granite-embedding-30m : 0.4124 |
| 283 | + p_value : 0.0044 |
| 284 | + p_value<0.05 so this result is statistically significant |
| 285 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 286 | +
|
| 287 | +
|
| 288 | +arguana ndcg_cut_5 |
| 289 | + granite-embedding-125m : 0.3718 |
| 290 | + granite-embedding-30m : 0.3582 |
| 291 | + p_value : 0.0292 |
| 292 | + p_value<0.05 so this result is statistically significant |
| 293 | + You can conclude that granite-embedding-125m generation is better on data of this sort |
| 294 | +``` |
0 commit comments