Vector DB Benchmark for Music Semantic Search

This project is part of the Vector Database Benchmarking video: https://youtu.be/X0PwwfcGSHU

This repository benchmarks multiple vector databases for music semantic search, using a shared dataset and query set. It provides both a CLI benchmarking tool and a web UI for side-by-side DB comparison.

Features

Benchmarks ingest time, query latency, recall, and hit rate for top-k search
Supports Qdrant, Milvus, Weaviate, Pinecone, TopK, and SQLite (local or cloud)
Flexible embedding: Use sentence-transformers (default) or OpenAI embeddings
Heuristic relevance: Weak label matching using tags/genres for recall/hit metrics
Rich CLI: Many flags for DB selection, concurrency, top-k sweep, teardown, etc.
Modern UI: FastAPI backend + static frontend for live DB comparison
Automated result plots: Generates summary charts and per-k metrics tables

Supported Databases

Qdrant (local/cloud)
Milvus (local)
Weaviate (local/cloud)
Pinecone (local/cloud)
TopK (cloud)

Dataset

Use the Muse Musical Sentiment dataset from Kaggle. Place the CSV as data/muse.csv.

You can test with data/sample_data.csv for a dry run.

Quick Start

Install dependencies

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Configure environment

Copy .env.example to .env and fill in DB URLs/API keys as needed

Start local DBs (optional)

docker compose -f scripts/docker-compose.yml up -d

Generate embeddings

python embeddings/embed.py --csv data/muse.csv --out data/embeddings.parquet
# For OpenAI: add --use_openai [--model text-embedding-3-large]

Run the benchmark

python benchmark.py --csv data/muse.csv --embeddings data/embeddings.parquet --dbs qdrant milvus weaviate pinecone topk sqlite --topk 10 --repetitions 5
# See all CLI flags with: python benchmark.py --help

View results

Summary and per-k plots: results/
Metrics: results/metrics.json

CLI Usage

python benchmark.py --csv data/muse.csv --embeddings data/embeddings.parquet --dbs qdrant milvus weaviate pinecone topk sqlite --topk 10 --repetitions 5 [--teardown_after_benchmark]

Key flags:

--dbs: List of DBs to benchmark (qdrant, milvus, weaviate, pinecone, topk, sqlite)
--topk: Top-k for search (default: 10)
--topk_sweep: List of k values to sweep (e.g. 5 10 50)
--repetitions: Number of repetitions per query
--concurrency: Number of concurrent query workers
--teardown_after_benchmark: Delete DB/index after run
--query_model: Embedding model for queries
--queries: Path to YAML file with queries/expected labels

Results:

Plots and tables in results/ (per-k and summary)
All metrics in results/metrics.json

Embedding Generation

By default, uses sentence-transformers/all-MiniLM-L6-v2. To use OpenAI embeddings:

python embeddings/embed.py --csv data/muse.csv --out data/embeddings.parquet --use_openai --model text-embedding-3-large

UI: Music Semantic Search – Multi-DB Compare

The ui/ folder provides a FastAPI backend and static frontend for live, side-by-side DB search and latency comparison.

UI Features

Compare Qdrant, Milvus, Weaviate, Pinecone, TopK, and SQLite in parallel
Per-DB query latency in ms
Simple, modern UI (HTML/JS/CSS)

UI Quick Start

Install dependencies

pip install -r requirements.txt

Configure

Create .env in repo root with DB endpoints and API keys

Run the server

uvicorn backend.server:app --reload --port 8000

Open the app

Go to http://localhost:8000

Project Structure

benchmark.py – Main benchmarking script (CLI)
embeddings/embed.py – Embedding generation (sentence-transformers or OpenAI)
databases/ – DB client wrappers (Qdrant, Milvus, Weaviate, Pinecone, TopK, SQLite)
plot_benchmarks.py – Plots and summary tables
results/ – Output metrics and plots
ui/ – Web UI (FastAPI backend + static frontend)
requirements.txt – Python dependencies

Troubleshooting

If Docker ports conflict, edit scripts/docker-compose.yml
If you see dimension mismatch errors, check embedding model and DB index size
For OpenAI, set OPENAI_API_KEY in your environment
For TopK, set API key in .env

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
databases		databases
embeddings		embeddings
scripts		scripts
ui		ui
utils		utils
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
benchmark.py		benchmark.py
plot_benchmarks.py		plot_benchmarks.py
queries.yaml		queries.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector DB Benchmark for Music Semantic Search

Features

Supported Databases

Dataset

Quick Start

CLI Usage

Embedding Generation

UI: Music Semantic Search – Multi-DB Compare

UI Features

UI Quick Start

Project Structure

Troubleshooting

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

andrisgauracs/Vector-DB-Benchmark-for-Music-Semantic-Search

Folders and files

Latest commit

History

Repository files navigation

Vector DB Benchmark for Music Semantic Search

Features

Supported Databases

Dataset

Quick Start

CLI Usage

Embedding Generation

UI: Music Semantic Search – Multi-DB Compare

UI Features

UI Quick Start

Project Structure

Troubleshooting

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages