Zeta DB: Multimodal Vector Database

Zeta DB is a Python-based multimodal vector database designed to store and search for images and text using their semantic embeddings. It leverages the power of CLIP models for generating embeddings and LanceDB for storage & similarity search.

Features

Multimodal Embeddings: Generates and stores embeddings for both images and text using CLIP models.
Efficient Similarity Search: Utilizes LanceDB for fast and scalable nearest neighbor search on embedding vectors.
Persistent Storage: LanceDB index is saved to disk.
Configurable: Easily configure model names, storage paths, and search parameters via src/config.py.
Duplicate Prevention: Basic mechanism to avoid re-indexing already processed image paths.
Normalization Option: Supports normalization of vectors for cosine similarity searches.

Project Structure

Zeta DB/
├── data/                     # Default directory for storing index
│   ├── index/                # Stores the LanceDB index file (vector_db.index)
├── src/                      # Source code
│   ├── __init__.py
│   ├── config.py             # Project configuration
│   ├── embedding.py          # CLIP model loading and embedding generation
│   ├── index.py              # LanceDB index management
│   └── utils.py              # Utility functions (e.g., vector normalization)
├── zeta.py                   # Main entry point
├── .gitignore
├── README.md
└── requirements.txt          # Python dependencies

Setup

Clone the repository:

git clone https://github.com/logxdx/Zeta-DB.git
cd Zeta DB

Create a virtual environment (recommended):

python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Configuration (Optional): Review and modify src/config.py if you need to change default paths, CLIP model, etc. The default configuration uses jinaai/jina-clip-v2 with full_size embeddings (1024 dimensions). It will automatically use CUDA if available, otherwise CPU.

Core Components

The src directory contains the core logic for Zeta DB, organized into the following modules:

config.py: Centralized configuration for all system parameters, including model names, file paths, and embedding dimensions. This allows for easy customization and management of the database settings.
embedding.py: Implements the CLIPEmbedder class, which is responsible for loading a pre-trained CLIP model and generating vector embeddings for both images and text. It's designed to be easily extensible to support other embedding models.
index.py: Contains the LanceDBIndex class, which manages the lancedb vector index. This class handles the creation, storage, and retrieval of vectors, providing a simple interface for performing similarity searches.
utils.py: A collection of utility functions used throughout the project. Currently, it includes a function for normalizing vectors, which is a crucial preprocessing step for certain distance metrics.

Getting Started

To use Zeta DB, you'll need to have Python installed, along with the necessary dependencies. Once set up, you can run the zeta.py script to start the application with the gradio interface.

Example Usage

python zeta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zeta DB: Multimodal Vector Database

Features

Project Structure

Setup

Core Components

Getting Started

Example Usage

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/index		data/index
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
zeta.py		zeta.py

logxdx/Zeta-DB

Folders and files

Latest commit

History

Repository files navigation

Zeta DB: Multimodal Vector Database

Features

Project Structure

Setup

Core Components

Getting Started

Example Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages