|
| 1 | +# Benchmarking Llama Stack with the BEIR Framework |
| 2 | + |
| 3 | +## Purpose |
| 4 | +The purpose of this script is to provide a variety of different benchmarks users can run with Llama Stack using standardized information retrieval benchmarks from the [BEIR](https://github.com/beir-cellar/beir) framework. |
| 5 | + |
| 6 | +## Available Benchmarks |
| 7 | +Currently there is only one benchmark available: |
| 8 | +1. [Benchmarking embedding models with BEIR Datasets and Llama Stack](benchmarking_embedding_models.md) |
| 9 | + |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | +* [Python](https://www.python.org/downloads/) > v3.12 |
| 13 | +* [uv](https://github.com/astral-sh/uv?tab=readme-ov-file#installation) installed |
| 14 | +* [ollama](https://ollama.com/) set up on your system and running the `meta-llama/Llama-3.2-3B-Instruct` model |
| 15 | + |
| 16 | +> [!NOTE] |
| 17 | +> Ollama can be replaced with an [inference provider](https://llama-stack.readthedocs.io/en/latest/providers/inference/index.html) of your choice |
| 18 | +
|
| 19 | +## Installation |
| 20 | + |
| 21 | +Initialize a virtual environment: |
| 22 | +``` bash |
| 23 | +uv venv .venv --python 3.12 --seed |
| 24 | +source .venv/bin/activate |
| 25 | +``` |
| 26 | + |
| 27 | +Install the required dependencies: |
| 28 | + |
| 29 | +```bash |
| 30 | +uv pip install -r requirements.txt |
| 31 | +``` |
| 32 | + |
| 33 | +Prepare your environment by running: |
| 34 | +``` bash |
| 35 | +# The run.yaml file is based on starter template https://github.com/meta-llama/llama-stack/tree/main/llama_stack/templates/starter |
| 36 | +# We run a build here to install all of the dependencies for the starter template |
| 37 | +llama stack build --template starter --image-type venv |
| 38 | +``` |
| 39 | + |
| 40 | +## Quick Start |
| 41 | + |
| 42 | +1. **Run a basic benchmark**: |
| 43 | +```bash |
| 44 | +# Runs the embedding models benchmark by default |
| 45 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --dataset-names scifact --embedding-models granite-embedding-125m |
| 46 | +``` |
| 47 | + |
| 48 | +2. **View results**: Results will be saved in the `results/` directory with detailed evaluation metrics. |
| 49 | + |
| 50 | +## File Structure |
| 51 | + |
| 52 | +``` |
| 53 | +beir-benchmarks/ |
| 54 | +├── README.md # This file |
| 55 | +├── beir_benchmarks.py # Main benchmarking script for multiple benchmarks |
| 56 | +├── benchmarking_embedding_models.md # Detailed documentation and guide |
| 57 | +├── requirements.txt # Python dependencies |
| 58 | +└── run.yaml # Llama Stack configuration |
| 59 | +``` |
| 60 | + |
| 61 | +## Usage Examples |
| 62 | + |
| 63 | +### Basic Usage |
| 64 | +```bash |
| 65 | +# Run benchmark with default settings |
| 66 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py |
| 67 | + |
| 68 | +# Specify custom dataset and model |
| 69 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --dataset-names scifact --embedding-models granite-embedding-125m |
| 70 | + |
| 71 | +# Run with custom batch size |
| 72 | +ENABLE_OLLAMA=ollama ENABLE_MILVUS=milvus OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" uv run python beir_benchmarks.py --batch-size 100 |
| 73 | +``` |
| 74 | + |
| 75 | +### Advanced Configuration |
| 76 | +For advanced configuration options and detailed setup instructions, see [benchmarking_embedding_models.md](benchmarking_embedding_models.md). |
| 77 | + |
| 78 | +## Results |
| 79 | + |
| 80 | +Benchmark results are automatically saved in the `results/` directory in TREC evaluation format. Each result file contains: |
| 81 | +- Query-document relevance scores |
| 82 | +- Ranking information for retrieval evaluation |
| 83 | +- Timestamp and model information in the filename |
| 84 | + |
| 85 | +## Support |
| 86 | + |
| 87 | +For detailed technical documentation, refer to: |
| 88 | +- [benchmarking_embedding_models.md](benchmarking_embedding_models.md) - Comprehensive guide for embedding models benchmark |
| 89 | +- [BEIR Documentation](https://github.com/beir-cellar/beir) - Official BEIR framework docs |
| 90 | +- [Llama Stack Documentation](https://llama-stack.readthedocs.io/) - Llama Stack API reference |
0 commit comments