Skip to content

Commit 695a7ae

Browse files
committed
docs: add readme for information retrieval benchmark
1 parent 720104a commit 695a7ae

File tree

1 file changed

+265
-0
lines changed
  • benchmarks/llama-stack-rag-with-beir

1 file changed

+265
-0
lines changed
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
# Benchmarking Information Retrieval with and without Llama Stack
2+
3+
## Purpose
4+
The purpose of this script is to benchmark retrieval accuracy with and without Llama Stack using the BEIR IR or BEIR like datasets.
5+
If everything is working as intended, it will show no difference with and without Llama Stack.
6+
In contrast, if there is some sort of defect in either Llama Stack or the alternative implementation this benchmark should be able to showcase it.
7+
8+
## Setup Instructions
9+
Ollama is required to run this example with the provided [run.yaml](run.yaml) file.
10+
11+
Setup a virtual environment:
12+
``` bash
13+
uv venv .venv --python 3.12 --seed
14+
source .venv/bin/activate
15+
```
16+
17+
Install the script's dependencies:
18+
``` bash
19+
uv pip install -r requirements.txt
20+
```
21+
22+
Prepare your environment by running:
23+
``` bash
24+
llama stack build --template ollama --image-type venv
25+
```
26+
27+
### About the run.yaml file
28+
* The run.yaml file makes use of Milvus inline as its vector database.
29+
* There are 3 default embedding models `ibm-granite/granite-embedding-125m-english`, `ibm-granite/granite-embedding-30m-english` and `all-MiniLM-L6-v2`.
30+
31+
To add your own embedding models you can update the `models` section of the `run.yaml` file.
32+
``` yaml
33+
# Example adding <example-model> embedding model with sentence-transformers as its provider
34+
models:
35+
- metadata: {}
36+
model_id: ${env.INFERENCE_MODEL}
37+
provider_id: ollama
38+
model_type: llm
39+
- metadata:
40+
embedding_dimension: 768
41+
model_id: granite-embedding-125m
42+
provider_id: sentence-transformers
43+
provider_model_id: ibm-granite/granite-embedding-125m-english
44+
model_type: embedding
45+
- metadata:
46+
embedding_dimension: <int>
47+
model_id: <example-model>
48+
provider_id: sentence-transformers
49+
provider_model_id: sentence-transformers/<example-model>
50+
model_type: embedding
51+
```
52+
53+
54+
## Running Instructions
55+
56+
### Basic Usage
57+
To run the script with default settings:
58+
59+
```bash
60+
# Update INFERENCE_MODEL to your preferred model served by Ollama
61+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py
62+
```
63+
64+
### Command-Line Options
65+
66+
#### `--dataset-names`
67+
**Description:** Specifies which BEIR datasets to use for benchmarking.
68+
69+
- **Type:** List of strings
70+
- **Default:** `["scifact"]`
71+
- **Options:** Any dataset from the [available BEIR Datasets](https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-available-datasets)
72+
- **Note:** When using custom datasets (via `--custom-datasets-urls`), this flag provides names for those datasets
73+
74+
**Example:**
75+
```bash
76+
# Single dataset
77+
--dataset-names scifact
78+
79+
# Multiple datasets
80+
--dataset-names scifact scidocs nq
81+
```
82+
83+
#### `--custom-datasets-urls`
84+
**Description:** Provides URLs for custom BEIR-compatible datasets instead of using the pre-made BEIR datasets.
85+
86+
- **Type:** List of strings
87+
- **Default:** `[]` (empty - uses standard BEIR datasets)
88+
- **Requirement:** Must be used together with `--dataset-names` flag
89+
- **Format:** URLs pointing to BEIR-compatible dataset archives
90+
91+
**Example:**
92+
```bash
93+
# Using custom datasets
94+
--dataset-names my-custom-dataset --custom-datasets-urls https://example.com/my-dataset.zip
95+
```
96+
97+
#### `--batch-size`
98+
**Description:** Controls the number of documents processed in each batch when injecting documents into the vector database.
99+
100+
- **Type:** Integer
101+
- **Default:** `150`
102+
- **Purpose:** Manages memory usage and processing efficiency when inserting large document collections
103+
- **Note:** Larger batch sizes may be faster but use more memory; smaller batch sizes use less memory but may be slower
104+
105+
**Example:**
106+
```bash
107+
# Using smaller batch size for memory-constrained environments
108+
--batch-size 50
109+
110+
# Using larger batch size for faster processing
111+
--batch-size 300
112+
```
113+
114+
> [!NOTE]
115+
Your custom Dataset must adhere to the following file structure and document standards. Below is a snippet of the file structure and example documents.
116+
117+
``` text
118+
dataset-name.zip/
119+
├── qrels/
120+
│ └── test.tsv # Relevance judgments mapping query IDs to document IDs with relevance scores
121+
├── corpus.jsonl # Document collection with document IDs, titles, and text content
122+
└── queries.jsonl # Test queries with query IDs and question text for retrieval evaluation
123+
```
124+
125+
**test.tsv**
126+
127+
| query-id | corpus-id | score |
128+
|----------|-----------|-------|
129+
| 0 | 0 | 1 |
130+
| 1 | 1 | 1 |
131+
132+
**corpus.jsonl**
133+
``` json
134+
{"_id": "0", "title": "Hook Lighthouse is located in Wexford, Ireland.", "metadata": {}}
135+
{"_id": "1", "title": "The Captain of the Pequod is Captain Ahab.", "metadata": {}}
136+
```
137+
138+
**queries.jsonl**
139+
``` json
140+
{"_id": "0", "text": "Hook Lighthouse location", "metadata": {}}
141+
{"_id": "1", "text": "Captain of the Pequod", "metadata": {}}
142+
```
143+
144+
### Usage Examples
145+
146+
**Basic benchmarking with default settings:**
147+
```bash
148+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py
149+
```
150+
151+
**Basic benchmarking with larger batch size:**
152+
```bash
153+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py --batch-size 300
154+
```
155+
156+
**Benchmark multiple datasets:**
157+
```bash
158+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py \
159+
--dataset-names scifact scidocs
160+
```
161+
162+
**Use custom datasets:**
163+
```bash
164+
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python benchmark_beir_ls_vs_no_ls.py \
165+
--dataset-names my-dataset \
166+
--custom-datasets-urls https://example.com/my-beir-dataset.zip
167+
```
168+
169+
### Sample Output
170+
Below is sample outputs for the following datasets:
171+
* scifact
172+
* fiqa
173+
* arguana
174+
175+
> [!NOTE]
176+
Benchmarking with these datasets will take a considerable amount of time given that fiqa and arguana are much larger and take longer to ingest.
177+
178+
```
179+
scifact map_cut_10
180+
LlamaStackRAGRetriever : 0.6879
181+
MilvusRetriever : 0.6879
182+
p_value : 1.0000
183+
p_value>=0.05 so this result is NOT statistically significant.
184+
You can conclude that there is not enough data to tell which is higher.
185+
Note that this data includes 300 questions which typically produces a margin of error of around +/-5.8%.
186+
So the two are probably roughly within that margin of error or so.
187+
188+
189+
scifact ndcg_cut_10
190+
LlamaStackRAGRetriever : 0.7350
191+
MilvusRetriever : 0.7350
192+
p_value : 1.0000
193+
p_value>=0.05 so this result is NOT statistically significant.
194+
You can conclude that there is not enough data to tell which is higher.
195+
Note that this data includes 300 questions which typically produces a margin of error of around +/-5.8%.
196+
So the two are probably roughly within that margin of error or so.
197+
198+
199+
scifact time
200+
LlamaStackRAGRetriever : 0.0225
201+
MilvusRetriever : 0.0173
202+
p_value : 0.0002
203+
p_value<0.05 so this result is statistically significant
204+
You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort.
205+
206+
207+
fiqa map_cut_10
208+
LlamaStackRAGRetriever : 0.3581
209+
MilvusRetriever : 0.3581
210+
p_value : 1.0000
211+
p_value>=0.05 so this result is NOT statistically significant.
212+
You can conclude that there is not enough data to tell which is higher.
213+
Note that this data includes 648 questions which typically produces a margin of error of around +/-3.9%.
214+
So the two are probably roughly within that margin of error or so.
215+
216+
217+
fiqa ndcg_cut_10
218+
LlamaStackRAGRetriever : 0.4411
219+
MilvusRetriever : 0.4411
220+
p_value : 1.0000
221+
p_value>=0.05 so this result is NOT statistically significant.
222+
You can conclude that there is not enough data to tell which is higher.
223+
Note that this data includes 648 questions which typically produces a margin of error of around +/-3.9%.
224+
So the two are probably roughly within that margin of error or so.
225+
226+
227+
fiqa time
228+
LlamaStackRAGRetriever : 0.0332
229+
MilvusRetriever : 0.0303
230+
p_value : 0.0002
231+
p_value<0.05 so this result is statistically significant
232+
You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort.
233+
234+
235+
/Users/bmurdock/beir/beir-venv-310/lib/python3.10/site-packages/scipy/stats/_resampling.py:1492: RuntimeWarning: overflow encountered in scalar power
236+
n_max = factorial(n_obs_sample)**n_samples
237+
arguana map_cut_10
238+
LlamaStackRAGRetriever : 0.2927
239+
MilvusRetriever : 0.2927
240+
p_value : 1.0000
241+
p_value>=0.05 so this result is NOT statistically significant.
242+
You can conclude that there is not enough data to tell which is higher.
243+
Note that this data includes 1406 questions which typically produces a margin of error of around +/-2.7%.
244+
So the two are probably roughly within that margin of error or so.
245+
246+
247+
arguana ndcg_cut_10
248+
LlamaStackRAGRetriever : 0.4251
249+
MilvusRetriever : 0.4251
250+
p_value : 1.0000
251+
p_value>=0.05 so this result is NOT statistically significant.
252+
You can conclude that there is not enough data to tell which is higher.
253+
Note that this data includes 1406 questions which typically produces a margin of error of around +/-2.7%.
254+
So the two are probably roughly within that margin of error or so.
255+
256+
257+
arguana time
258+
LlamaStackRAGRetriever : 0.0303
259+
MilvusRetriever : 0.0239
260+
p_value : 0.0002
261+
p_value<0.05 so this result is statistically significant
262+
You can conclude that LlamaStackRAGRetriever generation has a higher score on data of this sort.
263+
264+
No significant difference was detected. This is expected because LlamaStackRAGRetriever and MilvusRetriever are intended to do the same thing. This result is consistent with everything working as intended.
265+
```

0 commit comments

Comments
 (0)