A lightweight, pure-Python vector database built from scratch.
This project explores the mechanics of vector similarity search by implementing a custom indexer based on the Vamana Graph algorithm (DiskANN). Designed for educational purposes and lightweight use cases, including semantic search and Retrieval Augmented Generation (RAG).
Note: This project is work in progress. APIs and features are subject to change.
- Vamana Graph Indexing: Utilizes the algorithm behind DiskANN.
- Index Auto-Tuning: Implements adaptive tuning of the parameter alpha to stabilize average graph degree via a custom PI controller, fitting to different dataset structure and improving recall without sacrificing latency.
- Built-in Reranking: Natively supports MMR (Maximal Marginal Relevance) reranking out of the box, guaranteeing varied and contextually rich context for RAG applications.
- C-Level Speed: By leveraging Numba JIT compilation, TrovaDB achieves indexing and search performance comparable to C while maintaining a readable, hackable Python codebase.
- Persistence: The full database is stored reliably in a single SQLite file ensuring portability and crash-safety.
- Data Science Ready SDK: A lightweight Python client designed with native NumPy support and simple interface.
- Familiar Stack: Powered by FastAPI, SQLAlchemy and Alembic.
You can install TrovaDB directly from GitHub using pip.
If you want to run the database server locally, install it with the [server] extra:
pip install "trovadb[server] @ git+https://github.com/AlexHaborets/trovadb.git"pip install git+https://github.com/AlexHaborets/trovadb.gitOnce installed with the [server] extra, you can easily start the database server:
trovadb-server(Runs on localhost:8000 by default)
If you prefer not to install dependencies locally, you can clone the repository and run it instantly via Docker:
docker compose up --buildThe client is designed to be as intuitive as possible.
from trovadb.client import Client
with Client() as client:
# Create a collection
collection = client.get_or_create_collection("demo", dimension=3, metric="cosine")
# Upsert vectors (combines insert & update operations in one)
collection.upsert(
ids=["1", "2", "3", "4", "5"],
vectors=[
[0.1, 0.2, 0.3],
[0.9, 0.8, 0.7],
[0.2, 0.4, 0.4],
[0.1, 0.8, 0.2],
[0.5, 0.3, 0.6]
]
)
q = [0.1, 0.2, 0.3]
# Search for three nearest neighbors of q
results = collection.search(query=q, k=3)
print(results)
# Delete specified vectors
collection.delete(ids=["1", "3"])
# Delete entire collection
client.delete_collection("demo")Check out the examples folder in the root of the repository for detailed usage:
-
Tutorial Notebook: Interactive guide using Pandas and HuggingFace models.
-
Large Dataset Benchmark: A stress test loading 50,000+ DBpedia articles for RAG.
The name is inspired by the italian phrase "Cerca Trova" ("Seek and you shall find") — a cryptic clue left by Vasari in believed to indicate that a lost Da Vinci work is hidden beneath his fresco in Florence.
- DiskANN: Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. Advances in Neural Information Processing Systems (NeurIPS).
- FreshDiskANN: Singh, A., et al. (2021). FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. arXiv preprint arXiv:2105.09613.
- MMR: Carbonell, J., & Goldstein, J. (1998). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR '98.
- Vamana Visualization: sushrut141/vamana - A helpful repo demonstrating the core algorithm.
This project is licensed under the MIT License. See the LICENSE file for details.