|
| 1 | +# Benchmarking Information Retrieval with and without Llama Stack |
| 2 | + |
| 3 | +## Purpose |
| 4 | +The purpose of this script is to benchmark retrieval accuracy with and without Llama Stack using the BEIR IR or BEIR like datasets. |
| 5 | +If everything is working as intended, it will show no difference with and without Llama Stack. |
| 6 | +In contrast, if there is some sort of defect in either Llama Stack or the alternative implementation this benchmark should be able to showcase it. |
| 7 | + |
| 8 | +## Setup Instructions |
| 9 | +Ollama is required to run this example with the provided [run.yaml](run.yaml) file. |
| 10 | + |
| 11 | +Setup a virtual environment: |
| 12 | +``` bash |
| 13 | +uv venv .venv --python 3.12 --seed |
| 14 | +source .venv/bin/activate |
| 15 | +``` |
| 16 | + |
| 17 | +Install the script's dependencies: |
| 18 | +``` bash |
| 19 | +uv pip install -r requirements.txt |
| 20 | +``` |
| 21 | + |
| 22 | +Prepare your environment by running: |
| 23 | +``` bash |
| 24 | +llama stack build --template ollama --image-type venv |
| 25 | +``` |
| 26 | + |
| 27 | +### About the run.yaml file |
| 28 | +* The run.yaml file makes use of Milvus inline as its vector database. |
| 29 | +* There are 3 default embedding models `ibm-granite/granite-embedding-125m-english`, `ibm-granite/granite-embedding-30m-english` and `all-MiniLM-L6-v2`. |
| 30 | + |
| 31 | +To add your own embedding models you can update the `models` section of the `run.yaml` file. |
| 32 | +``` yaml |
| 33 | +# Example adding <example-model> embedding model with sentence-transformers as its provider |
| 34 | +models: |
| 35 | +- metadata: {} |
| 36 | + model_id: ${env.INFERENCE_MODEL} |
| 37 | + provider_id: ollama |
| 38 | + model_type: llm |
| 39 | +- metadata: |
| 40 | + embedding_dimension: 768 |
| 41 | + model_id: granite-embedding-125m |
| 42 | + provider_id: sentence-transformers |
| 43 | + provider_model_id: ibm-granite/granite-embedding-125m-english |
| 44 | + model_type: embedding |
| 45 | +- metadata: |
| 46 | + embedding_dimension: <int> |
| 47 | + model_id: <example-model> |
| 48 | + provider_id: sentence-transformers |
| 49 | + provider_model_id: sentence-transformers/<example-model> |
| 50 | + model_type: embedding |
| 51 | +``` |
| 52 | +
|
| 53 | +
|
| 54 | +## Running Instructions |
| 55 | +
|
| 56 | +### Basic Usage |
| 57 | +To run the script with default settings: |
| 58 | +
|
| 59 | +```bash |
| 60 | +# Update INFERENCE_MODEL to your preferred model served by Ollama |
| 61 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py |
| 62 | +``` |
| 63 | + |
| 64 | +### Command-Line Options |
| 65 | + |
| 66 | +#### `--dataset-names` |
| 67 | +**Description:** Specifies which BEIR datasets to use for benchmarking. |
| 68 | + |
| 69 | +- **Type:** List of strings |
| 70 | +- **Default:** `["scifact"]` |
| 71 | +- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets) |
| 72 | +- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets |
| 73 | + |
| 74 | +**Example:** |
| 75 | +```bash |
| 76 | +# Single dataset |
| 77 | +--dataset-names scifact |
| 78 | + |
| 79 | +# Multiple datasets |
| 80 | +--dataset-names scifact scidocs nq |
| 81 | +``` |
| 82 | + |
| 83 | +#### `--custom-datasets-urls` |
| 84 | +**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets. |
| 85 | + |
| 86 | +- **Type:** List of strings |
| 87 | +- **Default:** `[]` (empty - uses standard BEIR datasets) |
| 88 | +- **Requirement:** Must be used together with `--dataset-names` flag |
| 89 | +- **Format:** URLs pointing to BEIR-compatible dataset archives |
| 90 | + |
| 91 | +**Example:** |
| 92 | +```bash |
| 93 | +# Using custom datasets |
| 94 | +--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip |
| 95 | +``` |
| 96 | + |
| 97 | +#### `--batch-size` |
| 98 | +**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database. |
| 99 | + |
| 100 | +- **Type:** Integer |
| 101 | +- **Default:** `150` |
| 102 | +- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections |
| 103 | +- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower |
| 104 | + |
| 105 | +**Example:** |
| 106 | +```bash |
| 107 | +# Using smaller batch size for memory-constrained environments |
| 108 | +--batch-size 50 |
| 109 | + |
| 110 | +# Using larger batch size for faster processing |
| 111 | +--batch-size 300 |
| 112 | +``` |
| 113 | + |
| 114 | +> [!NOTE] |
| 115 | + Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents. |
| 116 | + |
| 117 | +``` text |
| 118 | +dataset-name.zip/ |
| 119 | +├── qrels/ |
| 120 | +│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores |
| 121 | +├── corpus.jsonl # Document collection with document IDs, titles, and text content |
| 122 | +└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation |
| 123 | +``` |
| 124 | + |
| 125 | +**test.tsv** |
| 126 | + |
| 127 | +| query-id | corpus-id | score | |
| 128 | +|----------|-----------|-------| |
| 129 | +| 0 | 0 | 1 | |
| 130 | +| 1 | 1 | 1 | |
| 131 | + |
| 132 | +**corpus.jsonl** |
| 133 | +``` json |
| 134 | +{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}} |
| 135 | +{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}} |
| 136 | +``` |
| 137 | + |
| 138 | +**queries.jsonl** |
| 139 | +``` json |
| 140 | +{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}} |
| 141 | +{"_id": "1", "text": "Captain of the Pequod", "metadata": {}} |
| 142 | +``` |
| 143 | + |
| 144 | +### Usage Examples |
| 145 | + |
| 146 | +**Basic benchmarking with default settings:** |
| 147 | +```bash |
| 148 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py |
| 149 | +``` |
| 150 | + |
| 151 | +**Basic benchmarking with larger batch size:** |
| 152 | +```bash |
| 153 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py --batch-size 300 |
| 154 | +``` |
| 155 | + |
| 156 | +**Benchmark multiple datasets:** |
| 157 | +```bash |
| 158 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py \ |
| 159 | + --dataset-names scifact scidocs |
| 160 | +``` |
| 161 | + |
| 162 | +**Use custom datasets:** |
| 163 | +```bash |
| 164 | +INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py \ |
| 165 | + --dataset-names my-dataset \ |
| 166 | + --custom-datasets-urls https://example.com/my-beir-dataset.zip |
| 167 | +``` |
| 168 | + |
| 169 | +### Sample Output |
| 170 | +Below is sample outputs for the following datasets: |
| 171 | +* scifact |
| 172 | +* fiqa |
| 173 | +* arguana |
| 174 | + |
| 175 | +> [!NOTE] |
| 176 | + Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest. |
| 177 | + |
| 178 | +``` |
| 179 | +scifact map_cut_10 |
| 180 | + LlamaStackRAGRetriever : 0.6879 |
| 181 | + MilvusRetriever : 0.6879 |
| 182 | + p_value : 1.0000 |
| 183 | + p_value>=0.05 so this result is NOT statistically significant. |
| 184 | + You can conclude that there is not enough data to tell which is higher. |
| 185 | + Note that this data includes 300 questions which typically produces a margin of error of around +/-5.8%. |
| 186 | + So the two are probably roughly within that margin of error or so. |
| 187 | +
|
| 188 | +
|
| 189 | +scifact ndcg_cut_10 |
| 190 | + LlamaStackRAGRetriever : 0.7350 |
| 191 | + MilvusRetriever : 0.7350 |
| 192 | + p_value : 1.0000 |
| 193 | + p_value>=0.05 so this result is NOT statistically significant. |
| 194 | + You can conclude that there is not enough data to tell which is higher. |
| 195 | + Note that this data includes 300 questions which typically produces a margin of error of around +/-5.8%. |
| 196 | + So the two are probably roughly within that margin of error or so. |
| 197 | +
|
| 198 | +
|
| 199 | +scifact time |
| 200 | + LlamaStackRAGRetriever : 0.0225 |
| 201 | + MilvusRetriever : 0.0173 |
| 202 | + p_value : 0.0002 |
| 203 | + p_value<0.05 so this result is statistically significant |
| 204 | + You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort. |
| 205 | +
|
| 206 | +
|
| 207 | +fiqa map_cut_10 |
| 208 | + LlamaStackRAGRetriever : 0.3581 |
| 209 | + MilvusRetriever : 0.3581 |
| 210 | + p_value : 1.0000 |
| 211 | + p_value>=0.05 so this result is NOT statistically significant. |
| 212 | + You can conclude that there is not enough data to tell which is higher. |
| 213 | + Note that this data includes 648 questions which typically produces a margin of error of around +/-3.9%. |
| 214 | + So the two are probably roughly within that margin of error or so. |
| 215 | +
|
| 216 | +
|
| 217 | +fiqa ndcg_cut_10 |
| 218 | + LlamaStackRAGRetriever : 0.4411 |
| 219 | + MilvusRetriever : 0.4411 |
| 220 | + p_value : 1.0000 |
| 221 | + p_value>=0.05 so this result is NOT statistically significant. |
| 222 | + You can conclude that there is not enough data to tell which is higher. |
| 223 | + Note that this data includes 648 questions which typically produces a margin of error of around +/-3.9%. |
| 224 | + So the two are probably roughly within that margin of error or so. |
| 225 | +
|
| 226 | +
|
| 227 | +fiqa time |
| 228 | + LlamaStackRAGRetriever : 0.0332 |
| 229 | + MilvusRetriever : 0.0303 |
| 230 | + p_value : 0.0002 |
| 231 | + p_value<0.05 so this result is statistically significant |
| 232 | + You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort. |
| 233 | +
|
| 234 | +
|
| 235 | +/Users/bmurdock/beir/beir-venv-310/lib/python3.10/site-packages/scipy/stats/_resampling.py:1492: RuntimeWarning: overflow encountered in scalar power |
| 236 | + n_max = factorial(n_obs_sample)**n_samples |
| 237 | +arguana map_cut_10 |
| 238 | + LlamaStackRAGRetriever : 0.2927 |
| 239 | + MilvusRetriever : 0.2927 |
| 240 | + p_value : 1.0000 |
| 241 | + p_value>=0.05 so this result is NOT statistically significant. |
| 242 | + You can conclude that there is not enough data to tell which is higher. |
| 243 | + Note that this data includes 1406 questions which typically produces a margin of error of around +/-2.7%. |
| 244 | + So the two are probably roughly within that margin of error or so. |
| 245 | +
|
| 246 | +
|
| 247 | +arguana ndcg_cut_10 |
| 248 | + LlamaStackRAGRetriever : 0.4251 |
| 249 | + MilvusRetriever : 0.4251 |
| 250 | + p_value : 1.0000 |
| 251 | + p_value>=0.05 so this result is NOT statistically significant. |
| 252 | + You can conclude that there is not enough data to tell which is higher. |
| 253 | + Note that this data includes 1406 questions which typically produces a margin of error of around +/-2.7%. |
| 254 | + So the two are probably roughly within that margin of error or so. |
| 255 | +
|
| 256 | +
|
| 257 | +arguana time |
| 258 | + LlamaStackRAGRetriever : 0.0303 |
| 259 | + MilvusRetriever : 0.0239 |
| 260 | + p_value : 0.0002 |
| 261 | + p_value<0.05 so this result is statistically significant |
| 262 | + You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort. |
| 263 | +
|
| 264 | +No significant difference was detected. This is expected because LlamaStackRAGRetriever and MilvusRetriever are intended to do the same thing. This result is consistent with everything working as intended. |
| 265 | +``` |
0 commit comments