Skip to content

Efficient, scalable, and lightweight vector database

License

Notifications You must be signed in to change notification settings

torinriley/VecStream

Repository files navigation

VecStream

A lightweight, efficient vector database with similarity search capabilities, designed for machine learning and AI applications.

Tests Benchmarks PyPI version Python versions License Downloads GitHub issues

Features

  • Fast similarity search using optimized indexing
  • HNSW indexing for significantly improved search performance
  • Vector collections/namespaces for organizing different types of embeddings
  • Metadata filtering for fine-grained search control
  • Efficient binary storage format for vectors and metadata
  • Automatic text embedding with sentence-transformers
  • Rich command-line interface with beautiful output
  • Cross-platform support (Windows, macOS, Linux)
  • Customizable storage locations
  • Metadata support for enhanced document management
  • Built-in text similarity search

Installation

pip install vecstream

Quick Start

Using the CLI

# Add a document
vecstream add "Machine learning is transforming technology" doc1

# Search for similar documents
vecstream search "AI and machine learning" --k 3

# Search with metadata filtering
vecstream search "cloud computing" --filter '{"category": "ai", "year": 2023}'

# Get document by ID
vecstream get doc1

# View database information
vecstream info

# Create and use a collection
vecstream create_collection research
vecstream add "Neural networks research" doc2 --collection research

# Use custom storage location
vecstream add "Custom storage test" doc3 --db-path "./my_vectors"

# Remove a document
vecstream remove doc1

Using the Python API

from vecstream.collections import CollectionManager
from vecstream.binary_store import BinaryVectorStore

# Using collections for different vector types
manager = CollectionManager("./vector_db")
research_collection = manager.create_collection("research")
products_collection = manager.create_collection("products")

# Add vectors with metadata to collections
research_collection.add_vector(
    id="paper1",
    vector=[1.0, 0.0, 0.0],
    metadata={"topic": "AI", "year": 2023, "author": "Smith"}
)

# Search with metadata filtering
results = research_collection.search_similar(
    query=[1.0, 0.0, 0.0],
    k=5,
    filter_metadata={"year": 2023, "topic": "AI"}
)

# Basic binary store usage (compatible with earlier versions)
store = BinaryVectorStore("./vector_db")

# Add vectors with metadata
store.add_vector(
    id="doc1",
    vector=[1.0, 0.0, 0.0],
    metadata={"text": "Example document", "tags": ["test"]}
)

# Search similar vectors
results = store.search_similar([1.0, 0.0, 0.0], k=5)

# Get vector with metadata
vector, metadata = store.get_vector_with_metadata("doc1")

Storage Locations

By default, VecStream stores its data in:

  • Windows: %APPDATA%/VecStream/store/
  • macOS/Linux: ~/.vecstream/store/

You can specify a custom storage location using the --db-path option in CLI commands or by passing the path to CollectionManager or BinaryVectorStore.

Storage Format

VecStream uses an efficient binary storage format:

  • Vectors: NumPy .npy format for fast access
  • Metadata: JSON format for flexibility
  • Automatic compression and optimization
  • Collections organized in subdirectories

CLI Features

The command-line interface provides:

  • Vector Management: Add, get, update and remove vectors with add, get, and remove commands
  • Similarity Search: Fast vector search with search command with adjustable k-nearest neighbors
  • HNSW Indexing: Significantly faster search performance for large datasets (up to 100x faster)
  • Collections: Organize vectors by type with collection create, collection list, and other commands
  • Metadata Filtering: Filter search results with --filter '{"key": "value"}' syntax
  • Nested Filters: Support for dot notation in filters like --filter '{"details.color": "red"}'
  • Beautiful UI: Rich, colored output and progress indicators for long operations
  • Database Stats: View detailed database information with info command
  • Custom Storage: Specify storage locations with --db-path option

Python API

The Python API offers:

  • HNSW Indexing: Fast approximate nearest-neighbor search with customizable parameters:
    from vecstream.hnsw_index import HNSWIndex
    index = HNSWIndex(dim=128, M=16, ef_construction=200)
  • Collections: Organize vectors with the CollectionManager:
    from vecstream.collections import CollectionManager
    manager = CollectionManager("./vector_db", use_hnsw=True)
    collection = manager.create_collection("images")
  • Metadata Filtering: Fine-grained search control:
    results = collection.search_similar(query, filter_metadata={"category": "electronics"})
  • Nested Filtering: Access nested properties with dot notation:
    results = collection.search_similar(query, filter_metadata={"details.color": "black"})
  • Binary Storage: Efficient serialization for large datasets:
    from vecstream.binary_store import BinaryVectorStore
    store = BinaryVectorStore("./vector_db")
  • Vector Operations: Direct access to similarity calculations, normalization, and more
  • Type Safety: Strong typing and error handling with descriptive exceptions

Requirements

  • Python 3.8 or higher
  • NumPy
  • SciPy
  • sentence-transformers
  • Rich (for CLI)
  • Click (for CLI)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version History

  • 0.3.0 (2024-03-XX)

    • Added HNSW indexing for faster similarity search
    • Added collections/namespaces for organizing vectors
    • Added metadata filtering for search results
    • Improved CLI with collection management commands
    • Performance optimizations
  • 0.2.0 (2024-03-XX)

    • Added binary vector store
    • Improved persistent storage
    • Enhanced CLI functionality
    • Added metadata support
  • 0.1.0 (2024-03-XX)

    • Initial release
    • Basic vector storage and search functionality
    • CLI interface
    • Client-server architecture

Documentation

Document Description Link
API Reference Complete reference of VecStream's classes, methods, and CLI commands API Reference
Advanced Usage Detailed examples and best practices for using VecStream Advanced Usage

Key Features

Feature Description Documentation
HNSW Indexing Fast approximate nearest neighbor search for large datasets API Reference, Usage Examples
Collections Organize vectors with metadata for better organization API Reference, Usage Examples
Metadata Filtering Filter search results using metadata properties API Reference, Usage Examples
Binary Storage Efficient storage format for large vector datasets API Reference, Usage Examples

About

Efficient, scalable, and lightweight vector database

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages