Skip to content

Commit 1464393

Browse files
committed
docs: add documentation for new modular benchmark script
1 parent 0dcee50 commit 1464393

File tree

2 files changed

+383
-0
lines changed

2 files changed

+383
-0
lines changed

benchmarks/beir-benchmarks/README.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Benchmarking Llama Stack with the BEIR Framework
2+
3+
## Purpose
4+
The purpose of this script is to provide a variety of different benchmarks users can run with Llama Stack using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework.
5+
6+
## Available Benchmarks
7+
Currently there is only one benchmark available:
8+
1. [Benchmarking embedding models with BEIR Datasets and Llama Stack](benchmarking_embedding_models.md)
9+
10+
11+
## Prerequisites
12+
* [Python](https://www.python.org/downloads/) > v3.12
13+
* [uv](https://github.com/astral-sh/uv?tab=readme-ov-file#installation) installed
14+
* [ollama](https://ollama.com/) set up on your system and running the `meta-llama/Llama-3.2-3B-Instruct` model
15+
16+
> [!NOTE]
17+
> Ollama can be replaced with an [inference provider](https://llama-stack.readthedocs.io/en/latest/providers/inference/index.html) of your choice
18+
19+
## Installation
20+
21+
Initialize a virtual environment:
22+
``` bash
23+
uv venv .venv --python 3.12 --seed
24+
source .venv/bin/activate
25+
```
26+
27+
Install the required dependencies:
28+
29+
```bash
30+
uv pip install -r requirements.txt
31+
```
32+
33+
Prepare your environment by running:
34+
``` bash
35+
# The run.yaml file is based on starter template https://github.com/meta-llama/llama-stack/tree/main/llama_stack/templates/starter
36+
# We run a build here to install all of the dependencies for the starter template
37+
llama stack build --template starter --image-type venv
38+
```
39+
40+
## Quick Start
41+
42+
1. **Run a basic benchmark**:
43+
```bash
44+
# Runs the embedding models benchmark by default
45+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --dataset-names scifact --embedding-models granite-embedding-125m
46+
```
47+
48+
2. **View results**: Results will be saved in the `results/` directory with detailed evaluation metrics.
49+
50+
## File Structure
51+
52+
```
53+
beir-benchmarks/
54+
├── README.md # This file
55+
├── beir_benchmarks.py # Main benchmarking script for multiple benchmarks
56+
├── benchmarking_embedding_models.md # Detailed documentation and guide
57+
├── requirements.txt # Python dependencies
58+
└── run.yaml # Llama Stack configuration
59+
```
60+
61+
## Usage Examples
62+
63+
### Basic Usage
64+
```bash
65+
# Run benchmark with default settings
66+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py
67+
68+
# Specify custom dataset and model
69+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --dataset-names scifact --embedding-models granite-embedding-125m
70+
71+
# Run with custom batch size
72+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --batch-size 100
73+
```
74+
75+
### Advanced Configuration
76+
For advanced configuration options and detailed setup instructions, see [benchmarking_embedding_models.md](benchmarking_embedding_models.md).
77+
78+
## Results
79+
80+
Benchmark results are automatically saved in the `results/` directory in TREC evaluation format. Each result file contains:
81+
- Query-document relevance scores
82+
- Ranking information for retrieval evaluation
83+
- Timestamp and model information in the filename
84+
85+
## Support
86+
87+
For detailed technical documentation, refer to:
88+
- [benchmarking_embedding_models.md](benchmarking_embedding_models.md) - Comprehensive guide for embedding models benchmark
89+
- [BEIR Documentation](https://github.com/beir-cellar/beir) - Official BEIR framework docs
90+
- [Llama Stack Documentation](https://llama-stack.readthedocs.io/) - Llama Stack API reference
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# Benchmarking embedding models with BEIR Datasets and Llama Stack
2+
3+
## Purpose
4+
The purpose of this script is to compare retrieval accuracy between embedding models using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework.
5+
6+
## Setup
7+
For the examples we use Ollama to serve the model which can easily be swapped for an inference provider of your choice.
8+
9+
Initialize a virtual environment:
10+
``` bash
11+
uv venv .venv --python 3.12 --seed
12+
source .venv/bin/activate
13+
```
14+
15+
Install the script's dependencies:
16+
``` bash
17+
uv pip install -r requirements.txt
18+
```
19+
20+
Prepare your environment by running:
21+
``` bash
22+
# The run.yaml file is based on starter template https://github.com/meta-llama/llama-stack/tree/main/llama_stack/templates/starter
23+
# We run a build here to install all of the dependencies for the starter template
24+
llama stack build --template starter --image-type venv
25+
```
26+
27+
## Running Instructions
28+
29+
### Basic Usage
30+
To run the script with default settings:
31+
32+
```bash
33+
# Update OLLAMA_INFERENCE_MODEL to your preferred model or similar for other inference providers
34+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py
35+
```
36+
37+
## Supported Embedding Models
38+
39+
Default supported embedding models:
40+
- `granite-embedding-30m`: IBM Granite 30M parameter embedding model
41+
- `granite-embedding-125m`: IBM Granite 125M parameter embedding model
42+
43+
It is possible to add more embedding models using the [Llama Stack Python Client](https://github.com/llamastack/llama-stack-client-python)
44+
45+
### Adding additional embedding models
46+
Below is an example of how you can add more embedding models to the models list.
47+
``` bash
48+
# First run the llama stack server via the run file
49+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run llama stack run run.yaml
50+
```
51+
``` bash
52+
# Adding the all-MiniLM-L6-v2 model via the llama-stack-client
53+
llama-stack-client models register all-MiniLM-L6-v2 --provider-id sentence-transformers --provider-model-id all-minilm:latest --metadata '{"embedding_dimension": 384}' --model-type embedding
54+
```
55+
> [!NOTE]
56+
> Shut down the Llama Stack server before running the benchmark
57+
58+
### Command-Line Options
59+
60+
#### `--dataset-names`
61+
**Description:** Specifies which BEIR datasets to use for benchmarking.
62+
63+
- **Type:** List of strings
64+
- **Default:** `["scifact"]`
65+
- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets)
66+
- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets
67+
68+
**Example:**
69+
```bash
70+
# Single dataset
71+
--dataset-names scifact
72+
73+
# Multiple datasets
74+
--dataset-names scifact scidocs nq
75+
```
76+
77+
#### `--embedding-models`
78+
**Description:** Specifies which embedding models to benchmark against each other.
79+
80+
- **Type:** List of strings
81+
- **Default:** `["granite-embedding-30m", "granite-embedding-125m"]`
82+
- **Requirement:** Embedding models must be defined in the `run.yaml` file
83+
- **Purpose:** Compare performance across different embedding models
84+
85+
**Example:**
86+
```bash
87+
# Default models
88+
--embedding-models granite-embedding-30m granite-embedding-125m
89+
90+
# Custom model selection
91+
--embedding-models all-MiniLM-L6-v2 granite-embedding-125m
92+
```
93+
94+
#### `--custom-datasets-urls`
95+
**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets.
96+
97+
- **Type:** List of strings
98+
- **Default:** `[]` (empty - uses standard BEIR datasets)
99+
- **Requirement:** Must be used together with `--dataset-names` flag
100+
- **Format:** URLs pointing to BEIR-compatible dataset archives
101+
102+
**Example:**
103+
```bash
104+
# Using custom datasets
105+
--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip
106+
```
107+
108+
#### `--batch-size`
109+
**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database.
110+
111+
- **Type:** Integer
112+
- **Default:** `150`
113+
- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections
114+
- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower
115+
116+
**Example:**
117+
```bash
118+
# Using smaller batch size for memory-constrained environments
119+
--batch-size 50
120+
121+
# Using larger batch size for faster processing
122+
--batch-size 300
123+
```
124+
125+
> [!NOTE]
126+
Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents.
127+
128+
``` text
129+
dataset-name.zip/
130+
├── qrels/
131+
│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores
132+
├── corpus.jsonl # Document collection with document IDs, titles, and text content
133+
└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation
134+
```
135+
136+
**test.tsv**
137+
138+
| query-id | corpus-id | score |
139+
|----------|-----------|-------|
140+
| 0 | 0 | 1 |
141+
| 1 | 1 | 1 |
142+
143+
**corpus.jsonl**
144+
``` json
145+
{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}}
146+
{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}}
147+
```
148+
149+
**queries.jsonl**
150+
``` json
151+
{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}}
152+
{"_id": "1", "text": "Captain of the Pequod", "metadata": {}}
153+
```
154+
155+
### Usage Examples
156+
157+
**Basic benchmarking with default settings:**
158+
```bash
159+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py
160+
```
161+
162+
**Basic benchmarking with larger batch size:**
163+
```bash
164+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --batch-size 300
165+
```
166+
167+
**Benchmark multiple datasets:**
168+
```bash
169+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \
170+
--dataset-names scifact scidocs
171+
```
172+
173+
**Compare specific embedding models:**
174+
```bash
175+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \
176+
--embedding-models granite-embedding-30m all-MiniLM-L6-v2
177+
```
178+
179+
**Use custom datasets:**
180+
```bash
181+
ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py \
182+
--dataset-names my-dataset \
183+
--custom-datasets-urls https://example.com/my-beir-dataset.zip
184+
```
185+
186+
### Sample Output
187+
Below is sample outputs for the following datasets:
188+
* scifact
189+
* fiqa
190+
* arguana
191+
192+
> [!NOTE]
193+
Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest.
194+
195+
```
196+
Scoring
197+
All results in <path-to>/rag/benchmarks/embedding-models-with-beir/results
198+
199+
scifact map_cut_10
200+
granite-embedding-125m : 0.6879
201+
granite-embedding-30m : 0.6578
202+
p_value : 0.0150
203+
p_value<0.05 so this result is statistically significant
204+
You can conclude that granite-embedding-125m generation is better on data of this sort
205+
206+
207+
scifact map_cut_5
208+
granite-embedding-125m : 0.6767
209+
granite-embedding-30m : 0.6481
210+
p_value : 0.0294
211+
p_value<0.05 so this result is statistically significant
212+
You can conclude that granite-embedding-125m generation is better on data of this sort
213+
214+
215+
scifact ndcg_cut_10
216+
granite-embedding-125m : 0.7350
217+
granite-embedding-30m : 0.7018
218+
p_value : 0.0026
219+
p_value<0.05 so this result is statistically significant
220+
You can conclude that granite-embedding-125m generation is better on data of this sort
221+
222+
223+
scifact ndcg_cut_5
224+
granite-embedding-125m : 0.7119
225+
granite-embedding-30m : 0.6833
226+
p_value : 0.0256
227+
p_value<0.05 so this result is statistically significant
228+
You can conclude that granite-embedding-125m generation is better on data of this sort
229+
230+
231+
fiqa map_cut_10
232+
granite-embedding-125m : 0.3581
233+
granite-embedding-30m : 0.2829
234+
p_value : 0.0002
235+
p_value<0.05 so this result is statistically significant
236+
You can conclude that granite-embedding-125m generation is better on data of this sort
237+
238+
239+
fiqa map_cut_5
240+
granite-embedding-125m : 0.3395
241+
granite-embedding-30m : 0.2664
242+
p_value : 0.0002
243+
p_value<0.05 so this result is statistically significant
244+
You can conclude that granite-embedding-125m generation is better on data of this sort
245+
246+
247+
fiqa ndcg_cut_10
248+
granite-embedding-125m : 0.4411
249+
granite-embedding-30m : 0.3599
250+
p_value : 0.0002
251+
p_value<0.05 so this result is statistically significant
252+
You can conclude that granite-embedding-125m generation is better on data of this sort
253+
254+
255+
fiqa ndcg_cut_5
256+
granite-embedding-125m : 0.4176
257+
granite-embedding-30m : 0.3355
258+
p_value : 0.0002
259+
p_value<0.05 so this result is statistically significant
260+
You can conclude that granite-embedding-125m generation is better on data of this sort
261+
262+
263+
arguana map_cut_10
264+
granite-embedding-125m : 0.2927
265+
granite-embedding-30m : 0.2821
266+
p_value : 0.0104
267+
p_value<0.05 so this result is statistically significant
268+
You can conclude that granite-embedding-125m generation is better on data of this sort
269+
270+
271+
arguana map_cut_5
272+
granite-embedding-125m : 0.2707
273+
granite-embedding-30m : 0.2594
274+
p_value : 0.0216
275+
p_value<0.05 so this result is statistically significant
276+
You can conclude that granite-embedding-125m generation is better on data of this sort
277+
278+
279+
arguana ndcg_cut_10
280+
granite-embedding-125m : 0.4251
281+
granite-embedding-30m : 0.4124
282+
p_value : 0.0044
283+
p_value<0.05 so this result is statistically significant
284+
You can conclude that granite-embedding-125m generation is better on data of this sort
285+
286+
287+
arguana ndcg_cut_5
288+
granite-embedding-125m : 0.3718
289+
granite-embedding-30m : 0.3582
290+
p_value : 0.0292
291+
p_value<0.05 so this result is statistically significant
292+
You can conclude that granite-embedding-125m generation is better on data of this sort
293+
```

0 commit comments

Comments
 (0)