Skip to content

Commit dbb402b

Browse files
Merge pull request #24 from Bobbins228/benchmark-embed-models
feat: add beir benchmark for embedding models
2 parents 24de683 + 695a7ae commit dbb402b

File tree

8 files changed

+1211
-56
lines changed

8 files changed

+1211
-56
lines changed
Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Benchmarking embedding models with BEIR Datasets with Llama Stack
2+
3+
## Purpose
4+
The purpose of this script is to compare retrieval accuracy between embedding models using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework.
5+
6+
Setup a virtual environment:
7+
``` bash
8+
uv venv .venv --python 3.12 --seed
9+
source .venv/bin/activate
10+
```
11+
12+
Install the script's dependencies:
13+
``` bash
14+
uv pip install -r requirements.txt
15+
```
16+
17+
Prepare your environment by running:
18+
``` bash
19+
llama stack build --template ollama --image-type venv
20+
```
21+
22+
### About the run.yaml file
23+
* The run.yaml file makes use of Milvus inline as its vector database.
24+
* There are 2 default embedding models `ibm-granite/granite-embedding-125m-english` and `ibm-granite/granite-embedding-30m-english`
25+
26+
To add your own embedding models you can update the `models` section of the `run.yaml` file.
27+
``` yaml
28+
# Example adding <example-model> embedding model with sentence-transformers as its provider
29+
models:
30+
- metadata: {}
31+
model_id: ${env.INFERENCE_MODEL}
32+
provider_id: ollama
33+
model_type: llm
34+
- metadata:
35+
embedding_dimension: 768
36+
model_id: granite-embedding-125m
37+
provider_id: sentence-transformers
38+
provider_model_id: ibm-granite/granite-embedding-125m-english
39+
model_type: embedding
40+
- metadata:
41+
embedding_dimension: <int>
42+
model_id: <example-model>
43+
provider_id: sentence-transformers
44+
provider_model_id: sentence-transformers/<example-model>
45+
model_type: embedding
46+
```
47+
48+
49+
## Running Instructions
50+
51+
### Basic Usage
52+
To run the script with default settings:
53+
54+
```bash
55+
# Update INFERENCE_MODEL to your preferred model served by Ollama
56+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py
57+
```
58+
59+
### Command-Line Options
60+
61+
#### `--dataset-names`
62+
**Description:** Specifies which BEIR datasets to use for benchmarking.
63+
64+
- **Type:** List of strings
65+
- **Default:** `["scifact"]`
66+
- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets)
67+
- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets
68+
69+
**Example:**
70+
```bash
71+
# Single dataset
72+
--dataset-names scifact
73+
74+
# Multiple datasets
75+
--dataset-names scifact scidocs nq
76+
```
77+
78+
#### `--embedding-models`
79+
**Description:** Specifies which embedding models to benchmark against each other.
80+
81+
- **Type:** List of strings
82+
- **Default:** `["granite-embedding-30m", "granite-embedding-125m"]`
83+
- **Requirement:** Embedding models must be defined in the `run.yaml` file
84+
- **Purpose:** Compare performance across different embedding models
85+
86+
**Example:**
87+
```bash
88+
# Default models
89+
--embedding-models granite-embedding-30m granite-embedding-125m
90+
91+
# Custom model selection
92+
--embedding-models all-MiniLM-L6-v2 granite-embedding-125m
93+
```
94+
95+
#### `--custom-datasets-urls`
96+
**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets.
97+
98+
- **Type:** List of strings
99+
- **Default:** `[]` (empty - uses standard BEIR datasets)
100+
- **Requirement:** Must be used together with `--dataset-names` flag
101+
- **Format:** URLs pointing to BEIR-compatible dataset archives
102+
103+
**Example:**
104+
```bash
105+
# Using custom datasets
106+
--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip
107+
```
108+
109+
#### `--batch-size`
110+
**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database.
111+
112+
- **Type:** Integer
113+
- **Default:** `150`
114+
- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections
115+
- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower
116+
117+
**Example:**
118+
```bash
119+
# Using smaller batch size for memory-constrained environments
120+
--batch-size 50
121+
122+
# Using larger batch size for faster processing
123+
--batch-size 300
124+
```
125+
126+
> [!NOTE]
127+
Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents.
128+
129+
``` text
130+
dataset-name.zip/
131+
├── qrels/
132+
│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores
133+
├── corpus.jsonl # Document collection with document IDs, titles, and text content
134+
└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation
135+
```
136+
137+
**test.tsv**
138+
139+
| query-id | corpus-id | score |
140+
|----------|-----------|-------|
141+
| 0 | 0 | 1 |
142+
| 1 | 1 | 1 |
143+
144+
**corpus.jsonl**
145+
``` json
146+
{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}}
147+
{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}}
148+
```
149+
150+
**queries.jsonl**
151+
``` json
152+
{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}}
153+
{"_id": "1", "text": "Captain of the Pequod", "metadata": {}}
154+
```
155+
156+
### Usage Examples
157+
158+
**Basic benchmarking with default settings:**
159+
```bash
160+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py
161+
```
162+
163+
**Basic benchmarking with larger batch size:**
164+
```bash
165+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py --batch-size 300
166+
```
167+
168+
**Benchmark multiple datasets:**
169+
```bash
170+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \
171+
--dataset-names scifact scidocs
172+
```
173+
174+
**Compare specific embedding models:**
175+
```bash
176+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \
177+
--embedding-models granite-embedding-30m all-MiniLM-L6-v2
178+
```
179+
180+
**Use custom datasets:**
181+
```bash
182+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_embedding_models.py \
183+
--dataset-names my-dataset \
184+
--custom-datasets-urls https://example.com/my-beir-dataset.zip
185+
```
186+
187+
### Sample Output
188+
Below is sample outputs for the following datasets:
189+
* scifact
190+
* fiqa
191+
* arguana
192+
193+
> [!NOTE]
194+
Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest.
195+
196+
```
197+
Scoring
198+
All results in <path-to>/rag/benchmarks/embedding-models-with-beir/results
199+
200+
scifact map_cut_10
201+
granite-embedding-125m : 0.6879
202+
granite-embedding-30m : 0.6578
203+
p_value : 0.0150
204+
p_value<0.05 so this result is statistically significant
205+
You can conclude that granite-embedding-125m generation is better on data of this sort
206+
207+
208+
scifact map_cut_5
209+
granite-embedding-125m : 0.6767
210+
granite-embedding-30m : 0.6481
211+
p_value : 0.0294
212+
p_value<0.05 so this result is statistically significant
213+
You can conclude that granite-embedding-125m generation is better on data of this sort
214+
215+
216+
scifact ndcg_cut_10
217+
granite-embedding-125m : 0.7350
218+
granite-embedding-30m : 0.7018
219+
p_value : 0.0026
220+
p_value<0.05 so this result is statistically significant
221+
You can conclude that granite-embedding-125m generation is better on data of this sort
222+
223+
224+
scifact ndcg_cut_5
225+
granite-embedding-125m : 0.7119
226+
granite-embedding-30m : 0.6833
227+
p_value : 0.0256
228+
p_value<0.05 so this result is statistically significant
229+
You can conclude that granite-embedding-125m generation is better on data of this sort
230+
231+
232+
fiqa map_cut_10
233+
granite-embedding-125m : 0.3581
234+
granite-embedding-30m : 0.2829
235+
p_value : 0.0002
236+
p_value<0.05 so this result is statistically significant
237+
You can conclude that granite-embedding-125m generation is better on data of this sort
238+
239+
240+
fiqa map_cut_5
241+
granite-embedding-125m : 0.3395
242+
granite-embedding-30m : 0.2664
243+
p_value : 0.0002
244+
p_value<0.05 so this result is statistically significant
245+
You can conclude that granite-embedding-125m generation is better on data of this sort
246+
247+
248+
fiqa ndcg_cut_10
249+
granite-embedding-125m : 0.4411
250+
granite-embedding-30m : 0.3599
251+
p_value : 0.0002
252+
p_value<0.05 so this result is statistically significant
253+
You can conclude that granite-embedding-125m generation is better on data of this sort
254+
255+
256+
fiqa ndcg_cut_5
257+
granite-embedding-125m : 0.4176
258+
granite-embedding-30m : 0.3355
259+
p_value : 0.0002
260+
p_value<0.05 so this result is statistically significant
261+
You can conclude that granite-embedding-125m generation is better on data of this sort
262+
263+
264+
arguana map_cut_10
265+
granite-embedding-125m : 0.2927
266+
granite-embedding-30m : 0.2821
267+
p_value : 0.0104
268+
p_value<0.05 so this result is statistically significant
269+
You can conclude that granite-embedding-125m generation is better on data of this sort
270+
271+
272+
arguana map_cut_5
273+
granite-embedding-125m : 0.2707
274+
granite-embedding-30m : 0.2594
275+
p_value : 0.0216
276+
p_value<0.05 so this result is statistically significant
277+
You can conclude that granite-embedding-125m generation is better on data of this sort
278+
279+
280+
arguana ndcg_cut_10
281+
granite-embedding-125m : 0.4251
282+
granite-embedding-30m : 0.4124
283+
p_value : 0.0044
284+
p_value<0.05 so this result is statistically significant
285+
You can conclude that granite-embedding-125m generation is better on data of this sort
286+
287+
288+
arguana ndcg_cut_5
289+
granite-embedding-125m : 0.3718
290+
granite-embedding-30m : 0.3582
291+
p_value : 0.0292
292+
p_value<0.05 so this result is statistically significant
293+
You can conclude that granite-embedding-125m generation is better on data of this sort
294+
```

0 commit comments

Comments
 (0)