diff --git a/.github/ignore-notebooks.txt b/.github/ignore-notebooks.txt index 55052688..61ba17de 100644 --- a/.github/ignore-notebooks.txt +++ b/.github/ignore-notebooks.txt @@ -7,4 +7,6 @@ 02_semantic_cache_optimization spring_ai_redis_rag.ipynb 00_litellm_proxy_redis.ipynb -04_redisvl_benchmarking_basics.ipynb \ No newline at end of file +04_redisvl_benchmarking_basics.ipynb +06_hnsw_to_svs_vamana_migration.ipynb +07_flat_to_svs_vamana_migration.ipynb \ No newline at end of file diff --git a/README.md b/README.md index a01de17f..6425baf0 100644 --- a/README.md +++ b/README.md @@ -69,6 +69,8 @@ Need quickstarts to begin your Redis AI journey? | ๐Ÿ”ข **Data Type Support** - Shows how to convert a float32 index to float16 or integer dataypes | [![Open In GitHub](https://img.shields.io/badge/View-GitHub-green)](python-recipes/vector-search/03_dtype_support.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/vector-search/03_dtype_support.ipynb) | | ๐Ÿ“Š **Benchmarking Basics** - Overview of search benchmarking basics with RedisVL and Python multiprocessing | [![Open In GitHub](https://img.shields.io/badge/View-GitHub-green)](python-recipes/vector-search/04_redisvl_benchmarking_basics.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/vector-search/04_redisvl_benchmarking_basics.ipynb) | | ๐Ÿ“Š **Multi Vector Search** - Overview of multi vector queries with RedisVL | [![Open In GitHub](https://img.shields.io/badge/View-GitHub-green)](python-recipes/vector-search/05_multivector_search.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/vector-search/05_multivector_search.ipynb) | +| ๐Ÿ—œ๏ธ **HNSW to SVS-VAMANA Migration** - Showcase how to migrate HNSW indices to SVS-VAMANA and demonstrate 50-75% memory savings benefits | [![Open In GitHub](https://img.shields.io/badge/View-GitHub-green)](python-recipes/vector-search/06_hnsw_to_svs_vamana_migration.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/vector-search/06_hnsw_to_svs_vamana_migration.ipynb) | +| ๐Ÿ—œ๏ธ **FLAT to SVS-VAMANA Migration** - Showcase how to migrate FLAT indices to SVS-VAMANA and demonstrate significant memory reduction benefits | [![Open In GitHub](https://img.shields.io/badge/View-GitHub-green)](python-recipes/vector-search/07_flat_to_svs_vamana_migration.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/vector-search/07_flat_to_svs_vamana_migration.ipynb) | ### Retrieval Augmented Generation (RAG) diff --git a/python-recipes/vector-search/06_hnsw_to_svs_vamana_migration.ipynb b/python-recipes/vector-search/06_hnsw_to_svs_vamana_migration.ipynb new file mode 100644 index 00000000..dbe20a7a --- /dev/null +++ b/python-recipes/vector-search/06_hnsw_to_svs_vamana_migration.ipynb @@ -0,0 +1,1314 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "# Migrating from HNSW to SVS-VAMANA\n", + "\n", + "## Let's Begin!\n", + "\"Open\n", + "\n", + "This notebook demonstrates how to migrate existing HNSW vector indices to SVS-VAMANA for improved memory efficiency while maintaining search quality.\n", + "\n", + "## What You'll Learn\n", + "\n", + "- How to assess your current HNSW index for migration\n", + "- Step-by-step migration from HNSW to SVS-VAMANA\n", + "- Memory usage comparison and cost analysis\n", + "- Search quality validation between HNSW and SVS-VAMANA\n", + "- Performance benchmarking and recall comparison\n", + "- Migration decision framework for production systems\n", + "\n", + "## Prerequisites\n", + "\n", + "- Redis Stack 8.2.0+ with RediSearch 2.8.10+\n", + "- Existing HNSW index with substantial data (1000+ documents recommended)\n", + "- High-dimensional vectors (768+ dimensions for best compression benefits)\n", + "\n", + "## HNSW vs SVS-VAMANA\n", + "\n", + "**HNSW (Hierarchical Navigable Small World):**\n", + "- Excellent search quality and recall\n", + "- Fast query performance\n", + "- Higher memory usage (stores full-precision vectors)\n", + "- Good for applications prioritizing search quality\n", + "\n", + "**SVS-VAMANA:**\n", + "- Competitive search quality with compression\n", + "- Significant memory savings (50-75% reduction)\n", + "- Built-in vector compression (LeanVec, quantization)\n", + "- Ideal for large-scale deployments with cost constraints" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ๐Ÿ“ฆ Installation & Setup\n", + "\n", + "This notebook requires **sentence-transformers** for generating embeddings and **Redis Stack** running in Docker.\n", + "\n", + "**Requirements:**\n", + "- Redis Stack 8.2.0+ with RediSearch 2.8.10+\n", + "- sentence-transformers (for generating embeddings)\n", + "- numpy (for vector operations)\n", + "- redisvl (should be available in your environment)\n", + "\n", + "**๐Ÿณ Docker Setup (Required):**\n", + "\n", + "Before running this notebook, make sure Redis Stack is running in Docker:\n", + "\n", + "```bash\n", + "# Start Redis Stack with Docker\n", + "docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", + "```\n", + "\n", + "Or if you prefer using docker-compose, create a `docker-compose.yml` file:\n", + "\n", + "```yaml\n", + "version: '3.8'\n", + "services:\n", + " redis:\n", + " image: redis/redis-stack:latest\n", + " ports:\n", + " - \"6379:6379\"\n", + " - \"8001:8001\"\n", + "```\n", + "\n", + "Then run: `docker-compose up -d`\n", + "\n", + "**๐Ÿ“š Python Dependencies Installation:**\n", + "\n", + "Install the required Python packages:\n", + "\n", + "```bash\n", + "# Install core dependencies\n", + "pip install redisvl numpy sentence-transformers\n", + "\n", + "# Or install with specific versions for compatibility\n", + "pip install redisvl>=0.2.0 numpy>=1.21.0 sentence-transformers>=2.2.0\n", + "```\n", + "\n", + "**For Google Colab users, run this cell:**\n", + "\n", + "```python\n", + "!pip install redisvl sentence-transformers numpy\n", + "```\n", + "\n", + "**For Conda users:**\n", + "\n", + "```bash\n", + "conda install numpy\n", + "pip install redisvl sentence-transformers\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# # Install dependencies if needed\n", + "# import sys\n", + "# import subprocess\n", + "\n", + "# def install_if_missing(package):\n", + "# try:\n", + "# __import__(package)\n", + "# except ImportError:\n", + "# print(f\"Installing {package}...\")\n", + "# subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", package])\n", + "\n", + "# # Check and install required packages\n", + "# install_if_missing(\"sentence-transformers\")\n", + "# install_if_missing(\"redisvl\")\n", + "\n", + "# print(\"โœ… All dependencies are ready!\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“š Libraries imported successfully!\n" + ] + } + ], + "source": [ + "# Import required libraries\n", + "import os\n", + "import json\n", + "import numpy as np\n", + "import time\n", + "from typing import List, Dict, Any\n", + "\n", + "# Redis and RedisVL imports\n", + "import redis\n", + "from redisvl.index import SearchIndex\n", + "from redisvl.query import VectorQuery\n", + "from redisvl.redis.utils import array_to_buffer, buffer_to_array\n", + "from redisvl.utils import CompressionAdvisor\n", + "from redisvl.redis.connection import supports_svs\n", + "\n", + "# Configuration\n", + "REDIS_URL = \"redis://localhost:6379\"\n", + "\n", + "print(\"๐Ÿ“š Libraries imported successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Verify Redis and SVS Support\n", + "\n", + "First, let's ensure Redis Stack is running and supports SVS-VAMANA." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โœ… Redis connection successful\n", + "๐Ÿ“Š Redis version: 8.2.2\n", + "โœ… SVS-VAMANA supported\n" + ] + } + ], + "source": [ + "# Test Redis connection and SVS support\n", + "try:\n", + " client = redis.Redis.from_url(REDIS_URL)\n", + " client.ping()\n", + " print(\"โœ… Redis connection successful\")\n", + " \n", + " # Check Redis version\n", + " redis_info = client.info()\n", + " redis_version = redis_info['redis_version']\n", + " print(f\"๐Ÿ“Š Redis version: {redis_version}\")\n", + " \n", + " # Check SVS support\n", + " if supports_svs(client):\n", + " print(\"โœ… SVS-VAMANA supported\")\n", + " else:\n", + " print(\"โŒ SVS-VAMANA not supported\")\n", + " print(\"Please ensure you're using Redis Stack 8.2.0+ with RediSearch 2.8.10+\")\n", + " \n", + "except Exception as e:\n", + " print(f\"โŒ Redis connection failed: {e}\")\n", + " print(\"Please ensure Redis Stack is running on localhost:6379\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Load Sample Data\n", + "\n", + "We'll use the movie dataset to demonstrate the migration process." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“ฝ๏ธ Loaded 20 movie records\n", + "Sample movie: Explosive Pursuit\n", + "Genres available: {'comedy', 'action'}\n", + "\n", + "๐Ÿ”ง Configuration:\n", + "Vector dimensions: 1024\n", + "Dataset size: 20 movie documents\n" + ] + } + ], + "source": [ + "# Load the movies dataset\n", + "with open('resources/movies.json', 'r') as f:\n", + " movies_data = json.load(f)\n", + "\n", + "print(\n", + " f\"๐Ÿ“ฝ๏ธ Loaded {len(movies_data)} movie records\",\n", + " f\"Sample movie: {movies_data[0]['title']}\",\n", + " f\"Genres available: {set(movie['genre'] for movie in movies_data)}\",\n", + " sep=\"\\n\"\n", + ")\n", + "\n", + "# Configuration for demonstration \n", + "dims = 1024 # sentence-transformers/all-roberta-large-v1 - 1024 dims\n", + "num_docs = len(movies_data) # Use actual dataset size\n", + "\n", + "print(\n", + " f\"\\n๐Ÿ”ง Configuration:\",\n", + " f\"Vector dimensions: {dims}\",\n", + " f\"Dataset size: {num_docs} movie documents\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Create HNSW Index\n", + "\n", + "First, we'll create an HNSW index with typical production settings." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating HNSW index with optimized settings...\n", + "โœ… Created HNSW index: hnsw_demo_index\n", + "\n", + "๐Ÿ”ง HNSW Configuration:\n", + "M (connections per node): 16\n", + "EF Construction: 200\n", + "EF Runtime: 10\n", + "Distance metric: cosine\n", + "Data type: float32\n" + ] + } + ], + "source": [ + "# Create HNSW schema with production-like settings\n", + "hnsw_schema = {\n", + " \"index\": {\n", + " \"name\": \"hnsw_demo_index\",\n", + " \"prefix\": \"demo:hnsw:\",\n", + " },\n", + " \"fields\": [\n", + " {\"name\": \"movie_id\", \"type\": \"tag\"},\n", + " {\"name\": \"title\", \"type\": \"text\"},\n", + " {\"name\": \"genre\", \"type\": \"tag\"},\n", + " {\"name\": \"rating\", \"type\": \"numeric\"},\n", + " {\"name\": \"description\", \"type\": \"text\"},\n", + " {\n", + " \"name\": \"embedding\",\n", + " \"type\": \"vector\",\n", + " \"attrs\": {\n", + " \"dims\": dims,\n", + " \"algorithm\": \"hnsw\",\n", + " \"datatype\": \"float32\",\n", + " \"distance_metric\": \"cosine\",\n", + " \"m\": 16, # Number of bi-directional links for each node\n", + " \"ef_construction\": 200, # Size of dynamic candidate list\n", + " \"ef_runtime\": 10 # Size of dynamic candidate list during search\n", + " }\n", + " }\n", + " ]\n", + "}\n", + "\n", + "print(\"Creating HNSW index with optimized settings...\")\n", + "hnsw_index = SearchIndex.from_dict(hnsw_schema, redis_url=REDIS_URL)\n", + "hnsw_index.create(overwrite=True)\n", + "print(f\"โœ… Created HNSW index: {hnsw_index.name}\")\n", + "\n", + "# Display HNSW configuration\n", + "print(\n", + " \"\\n๐Ÿ”ง HNSW Configuration:\",\n", + " f\"M (connections per node): 16\",\n", + " f\"EF Construction: 200\",\n", + " f\"EF Runtime: 10\",\n", + " f\"Distance metric: cosine\",\n", + " f\"Data type: float32\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Generate Embeddings and Load HNSW Index\n", + "\n", + "Generate embeddings for movie descriptions and populate the HNSW index." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ”„ Generating embeddings for movie descriptions...\n", + "14:40:35 sentence_transformers.SentenceTransformer INFO Use pytorch device_name: mps\n", + "14:40:35 sentence_transformers.SentenceTransformer INFO Load pretrained SentenceTransformer: all-MiniLM-L6-v2\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "dfa2af21d4904b58845f57a9786706e3", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Batches: 0%| | 0/1 [00:00 0:\n", + " for i in range(0, len(svs_data), batch_size):\n", + " batch = svs_data[i:i+batch_size]\n", + " svs_index.load(batch)\n", + " print(f\" Migrated {min(i+batch_size, len(svs_data))}/{len(svs_data)} documents\")\n", + "\n", + " # Wait for indexing to complete\n", + " print(\"โณ Waiting for SVS-VAMANA indexing to complete...\")\n", + " time.sleep(5)\n", + "\n", + " svs_info = svs_index.info()\n", + " print(\n", + " f\"\\nโœ… Migration complete! SVS index has {svs_info['num_docs']} documents\",\n", + " f\"Index size: {svs_info.get('vector_index_sz_mb', 'N/A')} MB\",\n", + " sep=\"\\n\"\n", + " )\n", + "else:\n", + " print(\"โš ๏ธ No data to migrate. Make sure the HNSW index was populated first.\")\n", + " print(\" Run the previous cells to load data into the HNSW index.\")\n", + " svs_info = svs_index.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 8: Compare Memory Usage\n", + "\n", + "Analyze the memory savings achieved through the HNSW to SVS-VAMANA migration." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“Š Memory Usage Comparison\n", + "========================================\n", + "Original HNSW index: 4.23 MB\n", + "SVS-VAMANA index: 1.02 MB\n", + "\n", + "๐Ÿ’ฐ Memory savings: 75.9%\n", + "Absolute reduction: 3.21 MB\n", + "\n", + "๐Ÿ’ต Cost Impact Analysis:\n", + "Monthly cost reduction: $0.23\n", + "Annual cost reduction: $2.71\n" + ] + } + ], + "source": [ + "# Helper function to extract memory info\n", + "def get_memory_mb(index_info):\n", + " \"\"\"Extract memory usage in MB from index info\"\"\"\n", + " memory = index_info.get('vector_index_sz_mb', 0)\n", + " if isinstance(memory, str):\n", + " try:\n", + " return float(memory)\n", + " except ValueError:\n", + " return 0.0\n", + " return float(memory)\n", + "\n", + "# Get memory usage\n", + "hnsw_memory = get_memory_mb(hnsw_info)\n", + "svs_memory = get_memory_mb(svs_info)\n", + "\n", + "print(\n", + " \"๐Ÿ“Š Memory Usage Comparison\",\n", + " \"=\" * 40,\n", + " f\"Original HNSW index: {hnsw_memory:.2f} MB\",\n", + " f\"SVS-VAMANA index: {svs_memory:.2f} MB\",\n", + " \"\",\n", + " sep=\"\\n\"\n", + ")\n", + "\n", + "if hnsw_memory > 0:\n", + " if svs_memory > 0:\n", + " savings = ((hnsw_memory - svs_memory) / hnsw_memory) * 100\n", + " print(\n", + " f\"๐Ÿ’ฐ Memory savings: {savings:.1f}%\",\n", + " f\"Absolute reduction: {hnsw_memory - svs_memory:.2f} MB\",\n", + " sep=\"\\n\"\n", + " )\n", + " else:\n", + " print(\"โณ SVS index still indexing - memory comparison pending\")\n", + " \n", + " # Cost analysis\n", + " print(\"\\n๐Ÿ’ต Cost Impact Analysis:\")\n", + " cost_per_gb_hour = 0.10 # Example cloud pricing\n", + " hours_per_month = 24 * 30\n", + " \n", + " hnsw_monthly_cost = (hnsw_memory / 1024) * cost_per_gb_hour * hours_per_month\n", + " if svs_memory > 0:\n", + " svs_monthly_cost = (svs_memory / 1024) * cost_per_gb_hour * hours_per_month\n", + " monthly_savings = hnsw_monthly_cost - svs_monthly_cost\n", + " print(\n", + " f\"Monthly cost reduction: ${monthly_savings:.2f}\",\n", + " f\"Annual cost reduction: ${monthly_savings * 12:.2f}\",\n", + " sep=\"\\n\"\n", + " )\n", + " else:\n", + " print(\n", + " f\"Current monthly cost: ${hnsw_monthly_cost:.2f}\",\n", + " \"Projected savings: Available after indexing completes\",\n", + " sep=\"\\n\"\n", + " )\n", + "else:\n", + " print(\"โš ๏ธ Memory information not available\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 9: Validate Search Quality\n", + "\n", + "Compare search quality between HNSW and SVS-VAMANA to ensure the migration maintains acceptable recall." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ” Generating test queries for quality validation...\n", + "Generated 10 test queries\n" + ] + } + ], + "source": [ + "# Generate test queries\n", + "print(\"๐Ÿ” Generating test queries for quality validation...\")\n", + "\n", + "np.random.seed(123) # For reproducible test queries\n", + "num_test_queries = 10\n", + "test_queries = []\n", + "\n", + "for i in range(num_test_queries):\n", + " # Create test query vectors\n", + " query_vec = np.random.random(dims).astype(np.float32)\n", + " query_vec = query_vec / np.linalg.norm(query_vec) # Normalize\n", + " test_queries.append(query_vec)\n", + "\n", + "print(f\"Generated {len(test_queries)} test queries\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ” Testing HNSW search quality...\n", + "HNSW search completed in 0.007 seconds\n" + ] + } + ], + "source": [ + "# Test HNSW search quality\n", + "print(\"๐Ÿ” Testing HNSW search quality...\")\n", + "\n", + "hnsw_results = []\n", + "hnsw_start = time.time()\n", + "\n", + "for query_vec in test_queries:\n", + " query = VectorQuery(\n", + " vector=query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"movie_id\", \"title\", \"genre\"],\n", + " dtype=\"float32\",\n", + " num_results=10\n", + " )\n", + " results = hnsw_index.query(query)\n", + " hnsw_results.append([doc[\"movie_id\"] for doc in results])\n", + "\n", + "hnsw_time = time.time() - hnsw_start\n", + "print(f\"HNSW search completed in {hnsw_time:.3f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ” Testing SVS-VAMANA search quality...\n", + "SVS-VAMANA search completed in 0.006 seconds\n" + ] + } + ], + "source": [ + "# Test SVS-VAMANA search quality\n", + "print(\"๐Ÿ” Testing SVS-VAMANA search quality...\")\n", + "\n", + "svs_results = []\n", + "svs_start = time.time()\n", + "\n", + "for i, query_vec in enumerate(test_queries):\n", + " # Adjust query vector for SVS index (handle dimensionality reduction)\n", + " if target_dims < dims:\n", + " svs_query_vec = query_vec[:target_dims]\n", + " else:\n", + " svs_query_vec = query_vec\n", + " \n", + " if target_dtype == 'float16':\n", + " svs_query_vec = svs_query_vec.astype(np.float16)\n", + " \n", + " query = VectorQuery(\n", + " vector=svs_query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"movie_id\", \"title\", \"genre\"],\n", + " dtype=target_dtype,\n", + " num_results=10\n", + " )\n", + " results = svs_index.query(query)\n", + " svs_results.append([doc[\"movie_id\"] for doc in results])\n", + "\n", + "svs_time = time.time() - svs_start\n", + "print(f\"SVS-VAMANA search completed in {svs_time:.3f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“Š Search Quality Comparison\n", + "========================================\n", + "Recall@5: 1.000 (100.0%)\n", + "Recall@10: 0.990 (99.0%)\n", + "\n", + "โฑ๏ธ Performance Comparison:\n", + "HNSW query time: 0.007s (0.7ms per query)\n", + "SVS-VAMANA query time: 0.006s (0.6ms per query)\n", + "Speed difference: +10.4%\n", + "\n", + "๐ŸŽฏ Quality Assessment: ๐ŸŸข Excellent - Minimal quality loss\n" + ] + } + ], + "source": [ + "# Calculate recall and performance metrics\n", + "def calculate_recall(reference_results, test_results, k=10):\n", + " \"\"\"Calculate recall@k between two result sets\"\"\"\n", + " if not reference_results or not test_results:\n", + " return 0.0\n", + " \n", + " total_recall = 0.0\n", + " for ref, test in zip(reference_results, test_results):\n", + " ref_set = set(ref[:k])\n", + " test_set = set(test[:k])\n", + " if len(ref_set) > 0:\n", + " recall = len(ref_set.intersection(test_set)) / len(ref_set)\n", + " total_recall += recall\n", + " \n", + " return total_recall / len(reference_results)\n", + "\n", + "# Calculate metrics\n", + "recall_at_5 = calculate_recall(hnsw_results, svs_results, k=5)\n", + "recall_at_10 = calculate_recall(hnsw_results, svs_results, k=10)\n", + "\n", + "print(\n", + " \"๐Ÿ“Š Search Quality Comparison\",\n", + " \"=\" * 40,\n", + " f\"Recall@5: {recall_at_5:.3f} ({recall_at_5*100:.1f}%)\",\n", + " f\"Recall@10: {recall_at_10:.3f} ({recall_at_10*100:.1f}%)\",\n", + " \"\",\n", + " \"โฑ๏ธ Performance Comparison:\",\n", + " f\"HNSW query time: {hnsw_time:.3f}s ({hnsw_time/num_test_queries*1000:.1f}ms per query)\",\n", + " f\"SVS-VAMANA query time: {svs_time:.3f}s ({svs_time/num_test_queries*1000:.1f}ms per query)\",\n", + " f\"Speed difference: {((hnsw_time - svs_time) / hnsw_time * 100):+.1f}%\",\n", + " sep=\"\\n\"\n", + ")\n", + "\n", + "# Quality assessment\n", + "if recall_at_10 >= 0.95:\n", + " quality_assessment = \"๐ŸŸข Excellent - Minimal quality loss\"\n", + "elif recall_at_10 >= 0.90:\n", + " quality_assessment = \"๐ŸŸก Good - Acceptable quality for most applications\"\n", + "elif recall_at_10 >= 0.80:\n", + " quality_assessment = \"๐ŸŸ  Fair - Consider if quality requirements are flexible\"\n", + "else:\n", + " quality_assessment = \"๐Ÿ”ด Poor - Migration not recommended\"\n", + "\n", + "print(f\"\\n๐ŸŽฏ Quality Assessment: {quality_assessment}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 10: Migration Decision Framework\n", + "\n", + "Based on the analysis, determine if migration is recommended." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿค” Migration Decision Analysis\n", + "========================================\n", + "\n", + "๐Ÿ“Š Criteria Evaluation:\n", + "Memory savings: 75.9% โœ… (threshold: 20%)\n", + "Search quality: 0.990 โœ… (threshold: 0.85)\n", + "\n", + "๐ŸŽฏ Migration Recommendation: ๐ŸŸข RECOMMENDED\n", + "๐Ÿ’ญ Reasoning: Migration provides significant memory savings while maintaining good search quality.\n" + ] + } + ], + "source": [ + "# Migration decision logic\n", + "memory_savings_threshold = 20 # Minimum 20% memory savings\n", + "recall_threshold = 0.85 # Minimum 85% recall@10\n", + "\n", + "memory_savings_pct = ((hnsw_memory - svs_memory) / hnsw_memory * 100) if hnsw_memory > 0 and svs_memory > 0 else 0\n", + "meets_memory_threshold = memory_savings_pct >= memory_savings_threshold\n", + "meets_quality_threshold = recall_at_10 >= recall_threshold\n", + "\n", + "print(\n", + " \"๐Ÿค” Migration Decision Analysis\",\n", + " \"=\" * 40,\n", + " \"\",\n", + " \"๐Ÿ“Š Criteria Evaluation:\",\n", + " f\"Memory savings: {memory_savings_pct:.1f}% {'โœ…' if meets_memory_threshold else 'โŒ'} (threshold: {memory_savings_threshold}%)\",\n", + " f\"Search quality: {recall_at_10:.3f} {'โœ…' if meets_quality_threshold else 'โŒ'} (threshold: {recall_threshold})\",\n", + " \"\",\n", + " sep=\"\\n\"\n", + ")\n", + "\n", + "if meets_memory_threshold and meets_quality_threshold:\n", + " recommendation = \"๐ŸŸข RECOMMENDED\"\n", + " reasoning = \"Migration provides significant memory savings while maintaining good search quality.\"\n", + "elif meets_memory_threshold and not meets_quality_threshold:\n", + " recommendation = \"๐ŸŸก CONDITIONAL\"\n", + " reasoning = \"Good memory savings but reduced search quality. Consider if your application can tolerate lower recall.\"\n", + "elif not meets_memory_threshold and meets_quality_threshold:\n", + " recommendation = \"๐ŸŸ  LIMITED BENEFIT\"\n", + " reasoning = \"Search quality is maintained but memory savings are minimal. Migration may not be worth the effort.\"\n", + "else:\n", + " recommendation = \"๐Ÿ”ด NOT RECOMMENDED\"\n", + " reasoning = \"Insufficient memory savings and/or poor search quality. Consider alternative optimization strategies.\"\n", + "\n", + "print(\n", + " f\"๐ŸŽฏ Migration Recommendation: {recommendation}\",\n", + " f\"๐Ÿ’ญ Reasoning: {reasoning}\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 11: Production Migration Checklist\n", + "\n", + "If migration is recommended, follow this checklist for production deployment." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“‹ HNSW to SVS-VAMANA Migration Checklist\n", + "==================================================\n", + "\n", + "PRE-MIGRATION:\n", + "โ–ก Backup existing HNSW index data\n", + "โ–ก Test migration on staging environment\n", + "โ–ก Validate search quality with real queries\n", + "โ–ก Measure baseline HNSW performance metrics\n", + "โ–ก Plan rollback strategy\n", + "โ–ก Document current HNSW parameters (M, EF_construction, EF_runtime)\n", + "\n", + "MIGRATION:\n", + "โ–ก Create SVS-VAMANA index with tested configuration\n", + "โ–ก Migrate data in batches during low-traffic periods\n", + "โ–ก Monitor memory usage and indexing progress\n", + "โ–ก Validate data integrity after migration\n", + "โ–ก Test search functionality thoroughly\n", + "โ–ก Compare recall metrics with baseline\n", + "\n", + "POST-MIGRATION:\n", + "โ–ก Monitor search performance and quality\n", + "โ–ก Track memory usage and cost savings\n", + "โ–ก Update application configuration\n", + "โ–ก Document new SVS-VAMANA settings\n", + "โ–ก Clean up old HNSW index after validation period\n", + "โ–ก Update monitoring and alerting thresholds\n", + "\n", + "๐Ÿ’ก HNSW-SPECIFIC TIPS:\n", + "โ€ข HNSW indices are more complex to rebuild than FLAT\n", + "โ€ข Consider the impact on applications using EF_runtime tuning\n", + "โ€ข SVS-VAMANA may have different optimal query parameters\n", + "โ€ข Test with your specific HNSW configuration (M, EF values)\n", + "โ€ข Monitor for 48-72 hours before removing HNSW index\n", + "โ€ข Keep compression settings documented for future reference\n" + ] + } + ], + "source": [ + "print(\n", + " \"๐Ÿ“‹ HNSW to SVS-VAMANA Migration Checklist\",\n", + " \"=\" * 50,\n", + " \"\\nPRE-MIGRATION:\",\n", + " \"โ–ก Backup existing HNSW index data\",\n", + " \"โ–ก Test migration on staging environment\",\n", + " \"โ–ก Validate search quality with real queries\",\n", + " \"โ–ก Measure baseline HNSW performance metrics\",\n", + " \"โ–ก Plan rollback strategy\",\n", + " \"โ–ก Document current HNSW parameters (M, EF_construction, EF_runtime)\",\n", + " \"\\nMIGRATION:\",\n", + " \"โ–ก Create SVS-VAMANA index with tested configuration\",\n", + " \"โ–ก Migrate data in batches during low-traffic periods\",\n", + " \"โ–ก Monitor memory usage and indexing progress\",\n", + " \"โ–ก Validate data integrity after migration\",\n", + " \"โ–ก Test search functionality thoroughly\",\n", + " \"โ–ก Compare recall metrics with baseline\",\n", + " \"\\nPOST-MIGRATION:\",\n", + " \"โ–ก Monitor search performance and quality\",\n", + " \"โ–ก Track memory usage and cost savings\",\n", + " \"โ–ก Update application configuration\",\n", + " \"โ–ก Document new SVS-VAMANA settings\",\n", + " \"โ–ก Clean up old HNSW index after validation period\",\n", + " \"โ–ก Update monitoring and alerting thresholds\",\n", + " \"\\n๐Ÿ’ก HNSW-SPECIFIC TIPS:\",\n", + " \"โ€ข HNSW indices are more complex to rebuild than FLAT\",\n", + " \"โ€ข Consider the impact on applications using EF_runtime tuning\",\n", + " \"โ€ข SVS-VAMANA may have different optimal query parameters\",\n", + " \"โ€ข Test with your specific HNSW configuration (M, EF values)\",\n", + " \"โ€ข Monitor for 48-72 hours before removing HNSW index\",\n", + " \"โ€ข Keep compression settings documented for future reference\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 12: Cleanup\n", + "\n", + "Clean up the demonstration indices." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿงน Cleaning up demonstration indices...\n", + "โœ… Deleted HNSW demonstration index\n", + "โœ… Deleted SVS-VAMANA demonstration index\n", + "\n", + "๐ŸŽ‰ HNSW to SVS-VAMANA migration demonstration complete!\n", + "\n", + "Next steps:\n", + "1. Apply learnings to your production HNSW indices\n", + "2. Test with your actual query patterns and data\n", + "3. Monitor performance in your environment\n", + "4. Consider gradual rollout strategy\n", + "5. Evaluate impact on applications using HNSW-specific features\n" + ] + } + ], + "source": [ + "print(\"๐Ÿงน Cleaning up demonstration indices...\")\n", + "\n", + "# Clean up HNSW index\n", + "try:\n", + " hnsw_index.delete(drop=True)\n", + " print(\"โœ… Deleted HNSW demonstration index\")\n", + "except Exception as e:\n", + " print(f\"โš ๏ธ Failed to delete HNSW index: {e}\")\n", + "\n", + "# Clean up SVS index\n", + "try:\n", + " svs_index.delete(drop=True)\n", + " print(\"โœ… Deleted SVS-VAMANA demonstration index\")\n", + "except Exception as e:\n", + " print(f\"โš ๏ธ Failed to delete SVS index: {e}\")\n", + "\n", + "print(\n", + " \"\\n๐ŸŽ‰ HNSW to SVS-VAMANA migration demonstration complete!\",\n", + " \"\\nNext steps:\",\n", + " \"1. Apply learnings to your production HNSW indices\",\n", + " \"2. Test with your actual query patterns and data\",\n", + " \"3. Monitor performance in your environment\",\n", + " \"4. Consider gradual rollout strategy\",\n", + " \"5. Evaluate impact on applications using HNSW-specific features\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/python-recipes/vector-search/07_flat_to_svs_vamana_migration.ipynb b/python-recipes/vector-search/07_flat_to_svs_vamana_migration.ipynb new file mode 100644 index 00000000..e52879c9 --- /dev/null +++ b/python-recipes/vector-search/07_flat_to_svs_vamana_migration.ipynb @@ -0,0 +1,1192 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "# Migrating from FLAT to SVS-VAMANA\n", + "\n", + "## Let's Begin!\n", + "\"Open\n", + "\n", + "This notebook demonstrates how to migrate existing FLAT vector indices to SVS-VAMANA for improved memory efficiency and cost savings.\n", + "\n", + "## What You'll Learn\n", + "\n", + "- How to assess your current FLAT index for migration\n", + "- Step-by-step migration from FLAT to SVS-VAMANA\n", + "- Memory usage comparison and cost analysis\n", + "- Search quality validation\n", + "- Performance benchmarking\n", + "- Migration decision framework\n", + "\n", + "## Prerequisites\n", + "\n", + "- Redis Stack 8.2.0+ with RediSearch 2.8.10+\n", + "- Existing vector index with substantial data (1000+ documents recommended)\n", + "- Vector embeddings (768 dimensions using sentence-transformers/all-mpnet-base-v2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ๐Ÿ“ฆ Installation & Setup\n", + "\n", + "This notebook requires **sentence-transformers** for generating embeddings and **Redis Stack** running in Docker.\n", + "\n", + "**Requirements:**\n", + "- Redis Stack 8.2.0+ with RediSearch 2.8.10+\n", + "- sentence-transformers (for generating embeddings)\n", + "- numpy (for vector operations)\n", + "- redisvl (should be available in your environment)\n", + "\n", + "**๐Ÿณ Docker Setup (Required):**\n", + "\n", + "Before running this notebook, make sure Redis Stack is running in Docker:\n", + "\n", + "```bash\n", + "# Start Redis Stack with Docker\n", + "docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", + "```\n", + "\n", + "Or if you prefer using docker-compose, create a `docker-compose.yml` file:\n", + "\n", + "```yaml\n", + "version: '3.8'\n", + "services:\n", + " redis:\n", + " image: redis/redis-stack:latest\n", + " ports:\n", + " - \"6379:6379\"\n", + " - \"8001:8001\"\n", + "```\n", + "\n", + "Then run: `docker-compose up -d`\n", + "\n", + "**๐Ÿ“š Python Dependencies Installation:**\n", + "\n", + "Install the required Python packages:\n", + "\n", + "```bash\n", + "# Install core dependencies\n", + "pip install redisvl numpy sentence-transformers\n", + "\n", + "# Or install with specific versions for compatibility\n", + "pip install redisvl>=0.2.0 numpy>=1.21.0 sentence-transformers>=2.2.0\n", + "```\n", + "\n", + "**For Google Colab users, run this cell:**\n", + "\n", + "```python\n", + "!pip install redisvl sentence-transformers numpy\n", + "```\n", + "\n", + "**For Conda users:**\n", + "\n", + "```bash\n", + "conda install numpy\n", + "pip install redisvl sentence-transformers\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "# Setup redis-vl environment\n", + "import os\n", + "import sys\n", + "import subprocess\n", + "# Required imports from redis-vl\n", + "import numpy as np\n", + "import time\n", + "from redisvl.index import SearchIndex\n", + "from redisvl.query import VectorQuery\n", + "from redisvl.redis.utils import array_to_buffer, buffer_to_array\n", + "from redisvl.utils import CompressionAdvisor\n", + "from redisvl.redis.connection import supports_svs\n", + "import redis\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Verify SVS-VAMANA Support\n", + "\n", + "First, let's ensure your Redis environment supports SVS-VAMANA." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โœ… Redis connection successful\n", + "โœ… SVS-VAMANA supported\n", + " Ready for migration!\n" + ] + } + ], + "source": [ + "# Check Redis connection and SVS support\n", + "REDIS_URL = \"redis://localhost:6379\"\n", + "\n", + "try:\n", + " client = redis.Redis.from_url(REDIS_URL)\n", + " client.ping()\n", + " print(\"โœ… Redis connection successful\")\n", + " \n", + " if supports_svs(client):\n", + " print(\"โœ… SVS-VAMANA supported\")\n", + " print(\" Ready for migration!\")\n", + " else:\n", + " print(\"โŒ SVS-VAMANA not supported\")\n", + " print(\" Requires Redis >= 8.2.0 with RediSearch >= 2.8.10\")\n", + " print(\" Please upgrade Redis Stack before proceeding\")\n", + " \n", + "except Exception as e:\n", + " print(f\"โŒ Redis connection failed: {e}\")\n", + " print(\" Please ensure Redis is running and accessible\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Assess Your Current Index\n", + "\n", + "For this demonstration, we'll create a sample FLAT index. In practice, you would analyze your existing index." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“ฅ Loading sample movie data...\n", + "Loaded 20 movie records\n", + "Sample movie: Explosive Pursuit - A daring cop chases a notorious criminal across the city in a high-stakes game of cat and mouse.\n" + ] + } + ], + "source": [ + "# Download sample data from redis-ai-resources\n", + "print(\"๐Ÿ“ฅ Loading sample movie data...\")\n", + "import os\n", + "import json\n", + "\n", + "# Load the movies dataset\n", + "url = \"resources/movies.json\"\n", + "with open(\"resources/movies.json\", \"r\") as f:\n", + " movies_data = json.load(f)\n", + "\n", + "print(f\"Loaded {len(movies_data)} movie records\")\n", + "print(f\"Sample movie: {movies_data[0]['title']} - {movies_data[0]['description']}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“Š Migration Assessment\n", + "Vector dimensions: 768 (sentence-transformers/all-mpnet-base-v2)\n", + "Dataset size: 20 movie documents\n", + "Data includes: title, genre, rating, description\n" + ] + } + ], + "source": [ + "# Configuration for demonstration \n", + "dims = 768 # sentence-transformers/all-mpnet-base-v2 - 768 dims\n", + "\n", + "num_docs = len(movies_data) # Use actual dataset size\n", + "\n", + "print(\n", + " \"๐Ÿ“Š Migration Assessment\",\n", + " f\"Vector dimensions: {dims} (sentence-transformers/all-mpnet-base-v2)\",\n", + " f\"Dataset size: {num_docs} movie documents\",\n", + " \"Data includes: title, genre, rating, description\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "Next, let's configure a smaple FLAT index. Notice the algorithm value, dims value, and datatype value under fields." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating sample FLAT index...\n", + "โœ… Created FLAT index: migration_demo_flat\n" + ] + } + ], + "source": [ + "flat_schema = {\n", + " \"index\": {\n", + " \"name\": \"migration_demo_flat\",\n", + " \"prefix\": \"demo:flat:\",\n", + " },\n", + " \"fields\": [\n", + " {\"name\": \"movie_id\", \"type\": \"tag\"},\n", + " {\"name\": \"title\", \"type\": \"text\"},\n", + " {\"name\": \"genre\", \"type\": \"tag\"},\n", + " {\"name\": \"rating\", \"type\": \"numeric\"},\n", + " {\"name\": \"description\", \"type\": \"text\"},\n", + " {\n", + " \"name\": \"embedding\",\n", + " \"type\": \"vector\",\n", + " \"attrs\": {\n", + " \"dims\": dims,\n", + " \"algorithm\": \"flat\",\n", + " \"datatype\": \"float32\",\n", + " \"distance_metric\": \"cosine\"\n", + " }\n", + " }\n", + " ]\n", + "}\n", + "\n", + "# Create and populate FLAT index\n", + "print(\"Creating sample FLAT index...\")\n", + "flat_index = SearchIndex.from_dict(flat_schema, redis_url=REDIS_URL)\n", + "flat_index.create(overwrite=True)\n", + "print(f\"โœ… Created FLAT index: {flat_index.name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "Generate embeddings for movie descriptions\n" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ”„ Generating embeddings for movie descriptions...\n", + "๐Ÿ“ฆ Loading sentence transformer model...\n", + "14:45:27 sentence_transformers.SentenceTransformer INFO Use pytorch device_name: mps\n", + "14:45:27 sentence_transformers.SentenceTransformer INFO Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n", + "โœ… Loaded embedding model with 768 dimensions\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0e06f2f860ec443e802a3fbf3961487c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Batches: 0%| | 0/1 [00:00 0:\n", + " for i in range(0, len(svs_data), batch_size):\n", + " batch = svs_data[i:i+batch_size]\n", + " svs_index.load(batch)\n", + " print(f\" Migrated {min(i+batch_size, len(svs_data))}/{len(svs_data)} documents\")\n", + "\n", + " # Wait for indexing to complete\n", + " print(\"Waiting for indexing to complete...\")\n", + " time.sleep(5)\n", + "\n", + " svs_info = svs_index.info()\n", + " print(f\"\\nโœ… Migration complete! SVS index has {svs_info['num_docs']} documents\")\n", + "else:\n", + " print(\"โš ๏ธ No data to migrate. Make sure the FLAT index was populated first.\")\n", + " print(\" Run the previous cells to load data into the FLAT index.\")\n", + " svs_info = svs_index.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Compare Memory Usage\n", + "\n", + "Let's analyze the memory savings achieved through compression. This is just an example on the small sample data. Use a larger dataset before deciding." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“Š Memory Usage Comparison\n", + "========================================\n", + "Original FLAT index: 3.02 MB\n", + "SVS-VAMANA index: 3.02 MB\n", + "\n", + "๐Ÿ’ฐ Memory savings: -0.0%\n", + "Absolute reduction: -0.00 MB\n", + "\n", + "๐Ÿ’ต Cost Impact Analysis:\n", + "Monthly cost reduction: $-0.00\n", + "Annual cost reduction: $-0.00\n" + ] + } + ], + "source": [ + "# Helper function to extract memory info\n", + "def get_memory_mb(index_info):\n", + " \"\"\"Extract memory usage in MB from index info\"\"\"\n", + " memory = index_info.get('vector_index_sz_mb', 0)\n", + " if isinstance(memory, str):\n", + " try:\n", + " return float(memory)\n", + " except ValueError:\n", + " return 0.0\n", + " return float(memory)\n", + "\n", + "# Get memory usage\n", + "flat_memory = get_memory_mb(flat_info)\n", + "svs_memory = get_memory_mb(svs_info)\n", + "\n", + "print(\n", + " \"๐Ÿ“Š Memory Usage Comparison\",\n", + " \"=\" * 40,\n", + " f\"Original FLAT index: {flat_memory:.2f} MB\",\n", + " f\"SVS-VAMANA index: {svs_memory:.2f} MB\",\n", + " \"\",\n", + " sep=\"\\n\"\n", + ")\n", + "\n", + "if flat_memory > 0:\n", + " if svs_memory > 0:\n", + " savings = ((flat_memory - svs_memory) / flat_memory) * 100\n", + " print(\n", + " f\"๐Ÿ’ฐ Memory savings: {savings:.1f}%\",\n", + " f\"Absolute reduction: {flat_memory - svs_memory:.2f} MB\",\n", + " sep=\"\\n\"\n", + " )\n", + " else:\n", + " print(\"โณ SVS index still indexing - memory comparison pending\")\n", + " \n", + " # Cost analysis\n", + " print(\"\\n๐Ÿ’ต Cost Impact Analysis:\")\n", + " cost_per_gb_hour = 0.10 # Example cloud pricing\n", + " hours_per_month = 24 * 30\n", + " \n", + " flat_monthly_cost = (flat_memory / 1024) * cost_per_gb_hour * hours_per_month\n", + " if svs_memory > 0:\n", + " svs_monthly_cost = (svs_memory / 1024) * cost_per_gb_hour * hours_per_month\n", + " monthly_savings = flat_monthly_cost - svs_monthly_cost\n", + " print(\n", + " f\"Monthly cost reduction: ${monthly_savings:.2f}\",\n", + " f\"Annual cost reduction: ${monthly_savings * 12:.2f}\",\n", + " sep=\"\\n\"\n", + " )\n", + " else:\n", + " print(\n", + " f\"Current monthly cost: ${flat_monthly_cost:.2f}\",\n", + " \"Projected savings: Available after indexing completes\",\n", + " sep=\"\\n\"\n", + " )\n", + "else:\n", + " print(\"โš ๏ธ Memory information not available\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 7: Validate Search Quality\n", + "\n", + "Test that the compressed index maintains good search quality." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ” Validating search quality...\n", + "Generated 5 test queries\n", + "\n", + "Testing original FLAT index...\n", + "FLAT search time: 0.012s (0.002s per query)\n", + "\n", + "Testing SVS-VAMANA index...\n", + "SVS search time: 0.017s (0.003s per query)\n", + "\n", + "๐Ÿ“ˆ Average recall@10: 1.000 (100.0%)\n", + "โœ… Excellent search quality maintained\n" + ] + } + ], + "source": [ + "print(\"๐Ÿ” Validating search quality...\")\n", + "\n", + "# Create test queries\n", + "num_test_queries = 5\n", + "test_queries = []\n", + "\n", + "for i in range(num_test_queries):\n", + " # Generate normalized test vector\n", + " query_vec = np.random.random(dims).astype(np.float32)\n", + " query_vec = query_vec / np.linalg.norm(query_vec)\n", + " test_queries.append(query_vec)\n", + "\n", + "print(f\"Generated {num_test_queries} test queries\")\n", + "\n", + "# Test FLAT index (ground truth)\n", + "print(\"\\nTesting original FLAT index...\")\n", + "flat_results = []\n", + "flat_start = time.time()\n", + "\n", + "for query_vec in test_queries:\n", + " query = VectorQuery(\n", + " vector=query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"movie_id\", \"title\", \"genre\"],\n", + " dtype=\"float32\",\n", + " num_results=10\n", + " )\n", + " results = flat_index.query(query)\n", + " flat_results.append([doc[\"movie_id\"] for doc in results])\n", + "\n", + "flat_time = time.time() - flat_start\n", + "print(f\"FLAT search time: {flat_time:.3f}s ({flat_time/num_test_queries:.3f}s per query)\")\n", + "\n", + "# Test SVS-VAMANA index\n", + "print(\"\\nTesting SVS-VAMANA index...\")\n", + "svs_results = []\n", + "svs_start = time.time()\n", + "\n", + "for i, query_vec in enumerate(test_queries):\n", + " # Adjust query vector for SVS index (handle dimensionality reduction)\n", + " if target_dims < dims:\n", + " svs_query_vec = query_vec[:target_dims]\n", + " else:\n", + " svs_query_vec = query_vec\n", + " \n", + " if target_dtype == 'float16':\n", + " svs_query_vec = svs_query_vec.astype(np.float16)\n", + " \n", + " query = VectorQuery(\n", + " vector=svs_query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"movie_id\", \"title\", \"genre\"],\n", + " dtype=target_dtype,\n", + " num_results=10\n", + " )\n", + " \n", + " try:\n", + " results = svs_index.query(query)\n", + " svs_results.append([doc[\"movie_id\"] for doc in results])\n", + " except Exception as e:\n", + " print(f\"Query {i+1} failed: {e}\")\n", + " svs_results.append([])\n", + "\n", + "svs_time = time.time() - svs_start\n", + "print(f\"SVS search time: {svs_time:.3f}s ({svs_time/num_test_queries:.3f}s per query)\")\n", + "\n", + "# Calculate recall if we have results\n", + "if svs_results and any(svs_results):\n", + " recalls = []\n", + " for flat_res, svs_res in zip(flat_results, svs_results):\n", + " if flat_res and svs_res:\n", + " intersection = set(flat_res).intersection(set(svs_res))\n", + " recall = len(intersection) / len(flat_res)\n", + " recalls.append(recall)\n", + " \n", + " if recalls:\n", + " avg_recall = np.mean(recalls)\n", + " print(f\"\\n๐Ÿ“ˆ Average recall@10: {avg_recall:.3f} ({avg_recall*100:.1f}%)\")\n", + " \n", + " if avg_recall >= 0.9:\n", + " print(\"โœ… Excellent search quality maintained\")\n", + " elif avg_recall >= 0.8:\n", + " print(\"โœ… Good search quality maintained\")\n", + " else:\n", + " print(\"โš ๏ธ Search quality may be impacted - consider adjusting compression\")\n", + "else:\n", + " print(\"โš ๏ธ SVS index may still be indexing - search quality test pending\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 8: Migration Decision Framework\n", + "\n", + "Based on the results, let's determine if migration is recommended." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐ŸŽฏ Migration Analysis & Recommendation\n", + "==================================================\n", + "Dataset: 20 documents, 768-dimensional vectors\n", + "Compression: LVQ4\n", + "Datatype: float32 โ†’ float32\n", + "\n", + "Memory savings: -0.0% (Modest)\n", + "Search quality: 1.0% recall (Acceptable)\n", + "Performance: 1.4x vs original (Acceptable)\n", + "\n", + "๐Ÿ† RECOMMENDATION:\n", + "โŒ MIGRATION NOT RECOMMENDED\n", + " โ€ข Insufficient benefits for current dataset\n", + " โ€ข Consider larger dataset or different compression\n", + " โ€ข SVS-VAMANA works best with high-dimensional data\n" + ] + } + ], + "source": [ + "print(\"๐ŸŽฏ Migration Analysis & Recommendation\")\n", + "print(\"=\" * 50)\n", + "\n", + "# Fallback configuration if not defined (for CI/CD compatibility)\n", + "if 'selected_config' not in locals():\n", + " from redisvl.utils import CompressionAdvisor\n", + " selected_config = CompressionAdvisor.recommend(dims=dims, priority=\"memory\")\n", + "\n", + "# Summarize configuration\n", + "print(f\"Dataset: {num_docs} documents, {dims}-dimensional vectors\")\n", + "print(f\"Compression: {selected_config.get('compression', 'None')}\")\n", + "print(f\"Datatype: float32 โ†’ {selected_config['datatype']}\")\n", + "if 'reduce' in selected_config:\n", + " reduction = ((dims - selected_config['reduce']) / dims) * 100\n", + " print(f\"Dimensions: {dims} โ†’ {selected_config['reduce']} ({reduction:.1f}% reduction)\")\n", + "print()\n", + "\n", + "# Decision criteria\n", + "memory_savings_significant = False\n", + "search_quality_acceptable = True\n", + "performance_acceptable = True\n", + "\n", + "if flat_memory > 0 and svs_memory > 0:\n", + " savings_pct = ((flat_memory - svs_memory) / flat_memory) * 100\n", + " memory_savings_significant = savings_pct > 25 # 25%+ savings considered significant\n", + " print(f\"Memory savings: {savings_pct:.1f}% ({'Significant' if memory_savings_significant else 'Modest'})\")\n", + "else:\n", + " print(\"Memory savings: Pending (SVS index still indexing)\")\n", + "\n", + "if 'recalls' in locals() and recalls:\n", + " avg_recall = np.mean(recalls)\n", + " search_quality_acceptable = avg_recall >= 0.8 # 80%+ recall considered acceptable\n", + " print(f\"Search quality: {avg_recall:.1f}% recall ({'Acceptable' if search_quality_acceptable else 'Needs improvement'})\")\n", + "else:\n", + " print(\"Search quality: Pending validation\")\n", + "\n", + "if 'flat_time' in locals() and 'svs_time' in locals():\n", + " performance_ratio = svs_time / flat_time if flat_time > 0 else 1\n", + " performance_acceptable = performance_ratio <= 2.0 # Allow up to 2x slower\n", + " print(f\"Performance: {performance_ratio:.1f}x vs original ({'Acceptable' if performance_acceptable else 'Slower than expected'})\")\n", + "else:\n", + " print(\"Performance: Pending comparison\")\n", + "\n", + "\n", + "# Final recommendation\n", + "print(\"\\n๐Ÿ† RECOMMENDATION:\")\n", + "if memory_savings_significant and search_quality_acceptable and performance_acceptable:\n", + " print(\"โœ… MIGRATE TO SVS-VAMANA\")\n", + " print(\" โ€ข Significant memory savings achieved\")\n", + " print(\" โ€ข Search quality maintained\")\n", + " print(\" โ€ข Performance impact acceptable\")\n", + " print(\" โ€ข Cost reduction benefits clear\")\n", + "elif memory_savings_significant and search_quality_acceptable:\n", + " print(\"โš ๏ธ CONSIDER MIGRATION WITH MONITORING\")\n", + " print(\" โ€ข Good memory savings and search quality\")\n", + " print(\" โ€ข Monitor performance in production\")\n", + " print(\" โ€ข Consider gradual rollout\")\n", + "elif memory_savings_significant:\n", + " print(\"โš ๏ธ MIGRATION NEEDS TUNING\")\n", + " print(\" โ€ข Memory savings achieved\")\n", + " print(\" โ€ข Search quality or performance needs improvement\")\n", + " print(\" โ€ข Try different compression settings\")\n", + "else:\n", + " print(\"โŒ MIGRATION NOT RECOMMENDED\")\n", + " print(\" โ€ข Insufficient benefits for current dataset\")\n", + " print(\" โ€ข Consider larger dataset or different compression\")\n", + " print(\" โ€ข SVS-VAMANA works best with high-dimensional data\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 9: Production Migration Checklist\n", + "\n", + "If migration is recommended, follow this checklist for production deployment." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿ“‹ Production Migration Checklist\n", + "========================================\n", + "\n", + "PRE-MIGRATION:\n", + "โ–ก Backup existing index data\n", + "โ–ก Test migration on staging environment\n", + "โ–ก Validate search quality with real queries\n", + "โ–ก Measure baseline performance metrics\n", + "โ–ก Plan rollback strategy\n", + "\n", + "MIGRATION:\n", + "โ–ก Create SVS-VAMANA index with tested configuration\n", + "โ–ก Migrate data in batches during low-traffic periods\n", + "โ–ก Monitor memory usage and indexing progress\n", + "โ–ก Validate data integrity after migration\n", + "โ–ก Test search functionality thoroughly\n", + "\n", + "POST-MIGRATION:\n", + "โ–ก Monitor search performance and quality\n", + "โ–ก Track memory usage and cost savings\n", + "โ–ก Update application configuration\n", + "โ–ก Document new index settings\n", + "โ–ก Clean up old index after validation period\n", + "\n", + "๐Ÿ’ก TIPS:\n", + "โ€ข Start with a subset of data for initial validation\n", + "โ€ข Use blue-green deployment for zero-downtime migration\n", + "โ€ข Monitor for 24-48 hours before removing old index\n", + "โ€ข Keep compression settings documented for future reference\n" + ] + } + ], + "source": [ + "print(\n", + " \"๐Ÿ“‹ Production Migration Checklist\",\n", + " \"=\" * 40,\n", + " \"\\nPRE-MIGRATION:\",\n", + " \"โ–ก Backup existing index data\",\n", + " \"โ–ก Test migration on staging environment\",\n", + " \"โ–ก Validate search quality with real queries\",\n", + " \"โ–ก Measure baseline performance metrics\",\n", + " \"โ–ก Plan rollback strategy\",\n", + " \"\\nMIGRATION:\",\n", + " \"โ–ก Create SVS-VAMANA index with tested configuration\",\n", + " \"โ–ก Migrate data in batches during low-traffic periods\",\n", + " \"โ–ก Monitor memory usage and indexing progress\",\n", + " \"โ–ก Validate data integrity after migration\",\n", + " \"โ–ก Test search functionality thoroughly\",\n", + " \"\\nPOST-MIGRATION:\",\n", + " \"โ–ก Monitor search performance and quality\",\n", + " \"โ–ก Track memory usage and cost savings\",\n", + " \"โ–ก Update application configuration\",\n", + " \"โ–ก Document new index settings\",\n", + " \"โ–ก Clean up old index after validation period\",\n", + " \"\\n๐Ÿ’ก TIPS:\",\n", + " \"โ€ข Start with a subset of data for initial validation\",\n", + " \"โ€ข Use blue-green deployment for zero-downtime migration\",\n", + " \"โ€ข Monitor for 24-48 hours before removing old index\",\n", + " \"โ€ข Keep compression settings documented for future reference\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 10: Cleanup\n", + "\n", + "Clean up the demonstration indices." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "๐Ÿงน Cleaning up demonstration indices...\n", + "โœ… Deleted FLAT demonstration index\n", + "โœ… Deleted SVS-VAMANA demonstration index\n", + "\n", + "๐ŸŽ‰ Migration demonstration complete!\n", + "\n", + "Next steps:\n", + "1. Apply learnings to your production data\n", + "2. Test with your actual query patterns\n", + "3. Monitor performance in your environment\n", + "4. Consider gradual rollout strategy\n" + ] + } + ], + "source": [ + "print(\"๐Ÿงน Cleaning up demonstration indices...\")\n", + "\n", + "# Clean up FLAT index\n", + "try:\n", + " flat_index.delete(drop=True)\n", + " print(\"โœ… Deleted FLAT demonstration index\")\n", + "except Exception as e:\n", + " print(f\"โš ๏ธ Failed to delete FLAT index: {e}\")\n", + "\n", + "# Clean up SVS index\n", + "try:\n", + " svs_index.delete(drop=True)\n", + " print(\"โœ… Deleted SVS-VAMANA demonstration index\")\n", + "except Exception as e:\n", + " print(f\"โš ๏ธ Failed to delete SVS index: {e}\")\n", + "\n", + "print(\n", + " \"\\n๐ŸŽ‰ Migration demonstration complete!\",\n", + " \"\\nNext steps:\",\n", + " \"1. Apply learnings to your production data\",\n", + " \"2. Test with your actual query patterns\",\n", + " \"3. Monitor performance in your environment\",\n", + " \"4. Consider gradual rollout strategy\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/python-recipes/vector-search/07_vector_algorithm_benchmark.ipynb b/python-recipes/vector-search/07_vector_algorithm_benchmark.ipynb new file mode 100644 index 00000000..9acb9c81 --- /dev/null +++ b/python-recipes/vector-search/07_vector_algorithm_benchmark.ipynb @@ -0,0 +1,959 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "# Vector Algorithm Benchmark: FLAT vs HNSW vs SVS-VAMANA\n", + "\n", + "## Let's Begin!\n", + "\"Open\n", + "\n", + "This notebook benchmarks FLAT, HNSW, and SVS-VAMANA vector search algorithms using **real data from Hugging Face** across different embedding dimensions.\n", + "\n", + "## What You'll Learn\n", + "\n", + "- **Memory usage comparison** across algorithms and dimensions\n", + "- **Index creation performance** with real text data\n", + "- **Query performance** and latency analysis\n", + "- **Search quality** with recall metrics on real embeddings\n", + "- **Algorithm selection guidance** based on your requirements\n", + "\n", + "## Benchmark Configuration\n", + "\n", + "- **Dataset**: SQuAD (Stanford Question Answering Dataset) from Hugging Face\n", + "- **Algorithms**: FLAT, HNSW, SVS-VAMANA\n", + "- **Dimensions**: 384, 768, 1536 (native sentence-transformer embeddings)\n", + "- **Dataset Size**: 1,000 documents per dimension\n", + "- **Query Set**: 50 real questions per configuration\n", + "- **Focus**: Real-world performance with actual text embeddings\n", + "\n", + "## Prerequisites\n", + "\n", + "- Redis Stack 8.2.0+ with RediSearch 2.8.10+\n", + "- At least 4GB RAM for comfortable benchmarking\n", + "- Internet connection for downloading SQuAD dataset\n", + "- ~30-45 minutes runtime for complete benchmark" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ๐Ÿ“ฆ Installation & Setup\n", + "\n", + "**๐Ÿณ Docker Setup (Required):**\n", + "\n", + "Before running this notebook, make sure Redis Stack is running:\n", + "\n", + "```bash\n", + "# Start Redis Stack with Docker\n", + "docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install dependencies if needed\n", + "import sys\n", + "import subprocess\n", + "\n", + "def install_if_missing(package):\n", + " try:\n", + " __import__(package.split('[')[0]) # Handle package[extras] format\n", + " except ImportError:\n", + " print(f\"Installing {package}...\")\n", + " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", package])\n", + "\n", + "# Check and install required packages\n", + "install_if_missing(\"redisvl\")\n", + "install_if_missing(\"matplotlib\")\n", + "install_if_missing(\"seaborn\")\n", + "install_if_missing(\"pandas\")\n", + "install_if_missing(\"datasets\")\n", + "install_if_missing(\"sentence-transformers\")\n", + "\n", + "print(\"โœ… All dependencies are ready!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import required libraries\n", + "import os\n", + "import json\n", + "import time\n", + "import psutil\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from typing import Dict, List, Tuple, Any\n", + "from dataclasses import dataclass\n", + "from collections import defaultdict\n", + "\n", + "# Redis and RedisVL imports\n", + "import redis\n", + "from redisvl.index import SearchIndex\n", + "from redisvl.query import VectorQuery\n", + "from redisvl.redis.utils import array_to_buffer, buffer_to_array\n", + "from redisvl.utils import CompressionAdvisor\n", + "from redisvl.redis.connection import supports_svs\n", + "\n", + "# Configuration\n", + "REDIS_URL = \"redis://localhost:6379\"\n", + "np.random.seed(42) # For reproducible results\n", + "\n", + "# Set up plotting style\n", + "plt.style.use('default')\n", + "sns.set_palette(\"husl\")\n", + "\n", + "print(\"๐Ÿ“š Libraries imported successfully!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Benchmark configuration\n", + "@dataclass\n", + "class BenchmarkConfig:\n", + " dimensions: List[int]\n", + " algorithms: List[str]\n", + " docs_per_dimension: int\n", + " query_count: int\n", + " \n", + "# Initialize benchmark configuration\n", + "config = BenchmarkConfig(\n", + " dimensions=[384, 768, 1536],\n", + " algorithms=['flat', 'hnsw', 'svs-vamana'],\n", + " docs_per_dimension=1000,\n", + " query_count=50\n", + ")\n", + "\n", + "print(\n", + " \"๐Ÿ”ง Benchmark Configuration:\",\n", + " f\"Dimensions: {config.dimensions}\",\n", + " f\"Algorithms: {config.algorithms}\",\n", + " f\"Documents per dimension: {config.docs_per_dimension:,}\",\n", + " f\"Test queries: {config.query_count}\",\n", + " f\"Total documents: {len(config.dimensions) * config.docs_per_dimension:,}\",\n", + " f\"Dataset: SQuAD from Hugging Face\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Verify Redis and SVS Support" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Test Redis connection and capabilities\n", + "try:\n", + " client = redis.Redis.from_url(REDIS_URL)\n", + " client.ping()\n", + " \n", + " redis_info = client.info()\n", + " redis_version = redis_info['redis_version']\n", + " \n", + " svs_supported = supports_svs(client)\n", + " \n", + " print(\n", + " \"โœ… Redis connection successful\",\n", + " f\"๐Ÿ“Š Redis version: {redis_version}\",\n", + " f\"๐Ÿ”ง SVS-VAMANA supported: {'โœ… Yes' if svs_supported else 'โŒ No'}\",\n", + " sep=\"\\n\"\n", + " )\n", + " \n", + " if not svs_supported:\n", + " print(\"โš ๏ธ SVS-VAMANA not supported. Benchmark will skip SVS tests.\")\n", + " config.algorithms = ['flat', 'hnsw'] # Remove SVS from tests\n", + " \n", + "except Exception as e:\n", + " print(f\"โŒ Redis connection failed: {e}\")\n", + " print(\"Please ensure Redis Stack is running on localhost:6379\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Load Real Dataset from Hugging Face\n", + "\n", + "Load the SQuAD dataset and generate real embeddings using sentence-transformers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def load_squad_dataset(num_docs: int) -> List[Dict[str, Any]]:\n", + " \"\"\"Load SQuAD dataset from Hugging Face\"\"\"\n", + " try:\n", + " from datasets import load_dataset\n", + " \n", + " print(\"๐Ÿ“ฅ Loading SQuAD dataset from Hugging Face...\")\n", + " \n", + " # Load SQuAD dataset\n", + " dataset = load_dataset(\"squad\", split=\"train\")\n", + " \n", + " # Take a subset for our benchmark\n", + " dataset = dataset.select(range(min(num_docs, len(dataset))))\n", + " \n", + " # Convert to our format\n", + " documents = []\n", + " for i, item in enumerate(dataset):\n", + " # Combine question and context for richer text\n", + " text = f\"{item['question']} {item['context']}\"\n", + " \n", + " documents.append({\n", + " 'doc_id': f'squad_{i:06d}',\n", + " 'title': item['title'],\n", + " 'question': item['question'],\n", + " 'context': item['context'][:500], # Truncate long contexts\n", + " 'text': text,\n", + " 'category': 'qa', # All are Q&A documents\n", + " 'score': 1.0\n", + " })\n", + " \n", + " print(f\"โœ… Loaded {len(documents)} documents from SQuAD\")\n", + " return documents\n", + " \n", + " except ImportError:\n", + " print(\"โš ๏ธ datasets library not available, falling back to local data\")\n", + " return load_local_fallback_data(num_docs)\n", + " except Exception as e:\n", + " print(f\"โš ๏ธ Failed to load SQuAD dataset: {e}\")\n", + " print(\"Falling back to local data...\")\n", + " return load_local_fallback_data(num_docs)\n", + "\n", + "def load_local_fallback_data(num_docs: int) -> List[Dict[str, Any]]:\n", + " \"\"\"Fallback to local movie dataset if SQuAD is not available\"\"\"\n", + " try:\n", + " import json\n", + " with open('resources/movies.json', 'r') as f:\n", + " movies = json.load(f)\n", + " \n", + " # Expand the small movie dataset by duplicating with variations\n", + " documents = []\n", + " for i in range(num_docs):\n", + " movie = movies[i % len(movies)]\n", + " documents.append({\n", + " 'doc_id': f'movie_{i:06d}',\n", + " 'title': f\"{movie['title']} (Variant {i // len(movies) + 1})\",\n", + " 'question': f\"What is {movie['title']} about?\",\n", + " 'context': movie['description'],\n", + " 'text': f\"What is {movie['title']} about? {movie['description']}\",\n", + " 'category': movie['genre'],\n", + " 'score': movie['rating']\n", + " })\n", + " \n", + " print(f\"โœ… Using local movie dataset: {len(documents)} documents\")\n", + " return documents\n", + " \n", + " except Exception as e:\n", + " print(f\"โŒ Failed to load local data: {e}\")\n", + " raise" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_embeddings_for_texts(texts: List[str], dimensions: int) -> np.ndarray:\n", + " \"\"\"Generate embeddings for texts using sentence-transformers\"\"\"\n", + " try:\n", + " from sentence_transformers import SentenceTransformer\n", + " \n", + " # Choose model based on target dimensions\n", + " if dimensions == 384:\n", + " model_name = 'all-MiniLM-L6-v2'\n", + " elif dimensions == 768:\n", + " model_name = 'all-mpnet-base-v2'\n", + " elif dimensions == 1536:\n", + " # For 1536D, use gtr-t5-xl which produces native 1536D embeddings\n", + " model_name = 'sentence-transformers/gtr-t5-xl'\n", + " else:\n", + " model_name = 'all-MiniLM-L6-v2' # Default\n", + " \n", + " print(f\"๐Ÿค– Generating {dimensions}D embeddings using {model_name}...\")\n", + " \n", + " model = SentenceTransformer(model_name)\n", + " embeddings = model.encode(texts, convert_to_numpy=True, show_progress_bar=True)\n", + " \n", + " # Handle dimension adjustment\n", + " current_dims = embeddings.shape[1]\n", + " if current_dims < dimensions:\n", + " # Pad with small random values (better than zeros)\n", + " padding_size = dimensions - current_dims\n", + " padding = np.random.normal(0, 0.01, (embeddings.shape[0], padding_size))\n", + " embeddings = np.concatenate([embeddings, padding], axis=1)\n", + " elif current_dims > dimensions:\n", + " # Truncate\n", + " embeddings = embeddings[:, :dimensions]\n", + " \n", + " # Normalize embeddings\n", + " norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n", + " embeddings = embeddings / norms\n", + " \n", + " print(f\"โœ… Generated embeddings: {embeddings.shape}\")\n", + " return embeddings.astype(np.float32)\n", + " \n", + " except ImportError:\n", + " print(f\"โš ๏ธ sentence-transformers not available, using synthetic embeddings\")\n", + " return generate_synthetic_embeddings(len(texts), dimensions)\n", + " except Exception as e:\n", + " print(f\"โš ๏ธ Error generating embeddings: {e}\")\n", + " print(\"Falling back to synthetic embeddings...\")\n", + " return generate_synthetic_embeddings(len(texts), dimensions)\n", + "\n", + "def generate_synthetic_embeddings(num_docs: int, dimensions: int) -> np.ndarray:\n", + " \"\"\"Generate synthetic embeddings as fallback\"\"\"\n", + " print(f\"๐Ÿ”„ Generating {num_docs} synthetic {dimensions}D embeddings...\")\n", + " \n", + " # Create base random vectors\n", + " embeddings = np.random.normal(0, 1, (num_docs, dimensions)).astype(np.float32)\n", + " \n", + " # Add some clustering structure\n", + " cluster_size = num_docs // 3\n", + " embeddings[:cluster_size, :min(50, dimensions)] += 0.5\n", + " embeddings[cluster_size:2*cluster_size, min(50, dimensions):min(100, dimensions)] += 0.5\n", + " \n", + " # Normalize vectors\n", + " norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n", + " embeddings = embeddings / norms\n", + " \n", + " return embeddings\n", + "\n", + "# Load real dataset and generate embeddings\n", + "print(\"๐Ÿ”„ Loading real dataset and generating embeddings...\")\n", + "\n", + "# Load the base dataset once\n", + "raw_documents = load_squad_dataset(config.docs_per_dimension)\n", + "texts = [doc['text'] for doc in raw_documents]\n", + "\n", + "# Generate separate query texts (use questions from SQuAD)\n", + "query_texts = [doc['question'] for doc in raw_documents[:config.query_count]]\n", + "\n", + "benchmark_data = {}\n", + "query_data = {}\n", + "\n", + "for dim in config.dimensions:\n", + " print(f\"\\n๐Ÿ“Š Processing {dim}D embeddings...\")\n", + " \n", + " # Generate embeddings for documents\n", + " embeddings = generate_embeddings_for_texts(texts, dim)\n", + " \n", + " # Generate embeddings for queries\n", + " query_embeddings = generate_embeddings_for_texts(query_texts, dim)\n", + " \n", + " # Combine documents with embeddings\n", + " documents = []\n", + " for i, (doc, embedding) in enumerate(zip(raw_documents, embeddings)):\n", + " documents.append({\n", + " **doc,\n", + " 'embedding': array_to_buffer(embedding, dtype='float32')\n", + " })\n", + " \n", + " benchmark_data[dim] = documents\n", + " query_data[dim] = query_embeddings\n", + "\n", + "print(\n", + " f\"\\nโœ… Generated benchmark data:\",\n", + " f\"Total documents: {sum(len(docs) for docs in benchmark_data.values()):,}\",\n", + " f\"Total queries: {sum(len(queries) for queries in query_data.values()):,}\",\n", + " f\"Dataset source: {'SQuAD (Hugging Face)' if 'squad_' in raw_documents[0]['doc_id'] else 'Local movies'}\",\n", + " sep=\"\\n\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Index Creation Benchmark\n", + "\n", + "Measure index creation time and memory usage for each algorithm and dimension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_index_schema(algorithm: str, dimensions: int, prefix: str) -> Dict[str, Any]:\n", + " \"\"\"Create index schema for the specified algorithm\"\"\"\n", + " \n", + " base_schema = {\n", + " \"index\": {\n", + " \"name\": f\"benchmark_{algorithm}_{dimensions}d\",\n", + " \"prefix\": prefix,\n", + " },\n", + " \"fields\": [\n", + " {\"name\": \"doc_id\", \"type\": \"tag\"},\n", + " {\"name\": \"title\", \"type\": \"text\"},\n", + " {\"name\": \"category\", \"type\": \"tag\"},\n", + " {\"name\": \"score\", \"type\": \"numeric\"},\n", + " {\n", + " \"name\": \"embedding\",\n", + " \"type\": \"vector\",\n", + " \"attrs\": {\n", + " \"dims\": dimensions,\n", + " \"distance_metric\": \"cosine\",\n", + " \"datatype\": \"float32\"\n", + " }\n", + " }\n", + " ]\n", + " }\n", + " \n", + " # Algorithm-specific configurations\n", + " vector_field = base_schema[\"fields\"][-1][\"attrs\"]\n", + " \n", + " if algorithm == 'flat':\n", + " vector_field[\"algorithm\"] = \"flat\"\n", + " \n", + " elif algorithm == 'hnsw':\n", + " vector_field.update({\n", + " \"algorithm\": \"hnsw\",\n", + " \"m\": 16,\n", + " \"ef_construction\": 200,\n", + " \"ef_runtime\": 10\n", + " })\n", + " \n", + " elif algorithm == 'svs-vamana':\n", + " # Get compression recommendation\n", + " compression_config = CompressionAdvisor.recommend(dims=dimensions, priority=\"memory\")\n", + " \n", + " vector_field.update({\n", + " \"algorithm\": \"svs-vamana\",\n", + " \"datatype\": compression_config.get('datatype', 'float32')\n", + " })\n", + " \n", + " # Handle dimensionality reduction for high dimensions\n", + " if 'reduce' in compression_config:\n", + " vector_field[\"dims\"] = compression_config['reduce']\n", + " \n", + " return base_schema\n", + "\n", + "def benchmark_index_creation(algorithm: str, dimensions: int, documents: List[Dict]) -> Tuple[SearchIndex, float, float]:\n", + " \"\"\"Benchmark index creation and return index, build time, and memory usage\"\"\"\n", + " \n", + " prefix = f\"bench:{algorithm}:{dimensions}d:\"\n", + " \n", + " # Clean up any existing index\n", + " try:\n", + " client.execute_command('FT.DROPINDEX', f'benchmark_{algorithm}_{dimensions}d')\n", + " except:\n", + " pass\n", + " \n", + " # Create schema and index\n", + " schema = create_index_schema(algorithm, dimensions, prefix)\n", + " \n", + " start_time = time.time()\n", + " \n", + " # Create index\n", + " index = SearchIndex.from_dict(schema, redis_url=REDIS_URL)\n", + " index.create(overwrite=True)\n", + " \n", + " # Load data in batches\n", + " batch_size = 100\n", + " for i in range(0, len(documents), batch_size):\n", + " batch = documents[i:i+batch_size]\n", + " index.load(batch)\n", + " \n", + " # Wait for indexing to complete\n", + " if algorithm == 'hnsw':\n", + " time.sleep(3) # HNSW needs more time for graph construction\n", + " else:\n", + " time.sleep(1)\n", + " \n", + " build_time = time.time() - start_time\n", + " \n", + " # Get index info for memory usage\n", + " try:\n", + " index_info = index.info()\n", + " index_size_mb = float(index_info.get('vector_index_sz_mb', 0))\n", + " except:\n", + " index_size_mb = 0.0\n", + " \n", + " return index, build_time, index_size_mb\n", + "\n", + "# Run index creation benchmarks\n", + "print(\"๐Ÿ—๏ธ Running index creation benchmarks...\")\n", + "\n", + "creation_results = {}\n", + "indices = {}\n", + "\n", + "for dim in config.dimensions:\n", + " print(f\"\\n๐Ÿ“Š Benchmarking {dim}D embeddings:\")\n", + " \n", + " for algorithm in config.algorithms:\n", + " print(f\" Creating {algorithm.upper()} index...\")\n", + " \n", + " try:\n", + " index, build_time, index_size_mb = benchmark_index_creation(\n", + " algorithm, dim, benchmark_data[dim]\n", + " )\n", + " \n", + " creation_results[f\"{algorithm}_{dim}\"] = {\n", + " 'algorithm': algorithm,\n", + " 'dimensions': dim,\n", + " 'build_time_sec': build_time,\n", + " 'index_size_mb': index_size_mb,\n", + " 'num_docs': len(benchmark_data[dim])\n", + " }\n", + " \n", + " indices[f\"{algorithm}_{dim}\"] = index\n", + " \n", + " print(\n", + " f\" โœ… {algorithm.upper()}: {build_time:.2f}s, {index_size_mb:.2f}MB\"\n", + " )\n", + " \n", + " except Exception as e:\n", + " print(f\" โŒ {algorithm.upper()} failed: {e}\")\n", + " creation_results[f\"{algorithm}_{dim}\"] = None\n", + "\n", + "print(\"\\nโœ… Index creation benchmarks complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Query Performance Benchmark\n", + "\n", + "Measure query latency and search quality for each algorithm." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def calculate_recall(retrieved_ids: List[str], ground_truth_ids: List[str], k: int) -> float:\n", + " \"\"\"Calculate recall@k between retrieved and ground truth results\"\"\"\n", + " if not ground_truth_ids or not retrieved_ids:\n", + " return 0.0\n", + " \n", + " retrieved_set = set(retrieved_ids[:k])\n", + " ground_truth_set = set(ground_truth_ids[:k])\n", + " \n", + " if len(ground_truth_set) == 0:\n", + " return 0.0\n", + " \n", + " intersection = len(retrieved_set.intersection(ground_truth_set))\n", + " return intersection / len(ground_truth_set)\n", + "\n", + "def benchmark_query_performance(index: SearchIndex, query_vectors: np.ndarray, \n", + " algorithm: str, dimensions: int) -> Dict[str, float]:\n", + " \"\"\"Benchmark query performance and quality\"\"\"\n", + " \n", + " latencies = []\n", + " all_results = []\n", + " \n", + " # Get ground truth from FLAT index (if available)\n", + " ground_truth_results = []\n", + " flat_index_key = f\"flat_{dimensions}\"\n", + " \n", + " if flat_index_key in indices and algorithm != 'flat':\n", + " flat_index = indices[flat_index_key]\n", + " for query_vec in query_vectors:\n", + " query = VectorQuery(\n", + " vector=query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"doc_id\"],\n", + " dtype=\"float32\",\n", + " num_results=10\n", + " )\n", + " results = flat_index.query(query)\n", + " ground_truth_results.append([doc[\"doc_id\"] for doc in results])\n", + " \n", + " # Benchmark the target algorithm\n", + " for i, query_vec in enumerate(query_vectors):\n", + " # Adjust query vector for SVS if needed\n", + " if algorithm == 'svs-vamana':\n", + " compression_config = CompressionAdvisor.recommend(dims=dimensions, priority=\"memory\")\n", + " \n", + " if 'reduce' in compression_config:\n", + " target_dims = compression_config['reduce']\n", + " if target_dims < dimensions:\n", + " query_vec = query_vec[:target_dims]\n", + " \n", + " if compression_config.get('datatype') == 'float16':\n", + " query_vec = query_vec.astype(np.float16)\n", + " dtype = 'float16'\n", + " else:\n", + " dtype = 'float32'\n", + " else:\n", + " dtype = 'float32'\n", + " \n", + " # Execute query with timing\n", + " start_time = time.time()\n", + " \n", + " query = VectorQuery(\n", + " vector=query_vec,\n", + " vector_field_name=\"embedding\",\n", + " return_fields=[\"doc_id\", \"title\", \"category\"],\n", + " dtype=dtype,\n", + " num_results=10\n", + " )\n", + " \n", + " results = index.query(query)\n", + " latency = time.time() - start_time\n", + " \n", + " latencies.append(latency * 1000) # Convert to milliseconds\n", + " all_results.append([doc[\"doc_id\"] for doc in results])\n", + " \n", + " # Calculate metrics\n", + " avg_latency = np.mean(latencies)\n", + " \n", + " # Calculate recall if we have ground truth\n", + " if ground_truth_results and algorithm != 'flat':\n", + " recall_5_scores = []\n", + " recall_10_scores = []\n", + " \n", + " for retrieved, ground_truth in zip(all_results, ground_truth_results):\n", + " recall_5_scores.append(calculate_recall(retrieved, ground_truth, 5))\n", + " recall_10_scores.append(calculate_recall(retrieved, ground_truth, 10))\n", + " \n", + " recall_at_5 = np.mean(recall_5_scores)\n", + " recall_at_10 = np.mean(recall_10_scores)\n", + " else:\n", + " # FLAT is our ground truth, so perfect recall\n", + " recall_at_5 = 1.0 if algorithm == 'flat' else 0.0\n", + " recall_at_10 = 1.0 if algorithm == 'flat' else 0.0\n", + " \n", + " return {\n", + " 'avg_query_time_ms': avg_latency,\n", + " 'recall_at_5': recall_at_5,\n", + " 'recall_at_10': recall_at_10,\n", + " 'num_queries': len(query_vectors)\n", + " }\n", + "\n", + "# Run query performance benchmarks\n", + "print(\"๐Ÿ” Running query performance benchmarks...\")\n", + "\n", + "query_results = {}\n", + "\n", + "for dim in config.dimensions:\n", + " print(f\"\\n๐Ÿ“Š Benchmarking {dim}D queries:\")\n", + " \n", + " for algorithm in config.algorithms:\n", + " index_key = f\"{algorithm}_{dim}\"\n", + " \n", + " if index_key in indices:\n", + " print(f\" Testing {algorithm.upper()} queries...\")\n", + " \n", + " try:\n", + " performance = benchmark_query_performance(\n", + " indices[index_key], \n", + " query_data[dim], \n", + " algorithm, \n", + " dim\n", + " )\n", + " \n", + " query_results[index_key] = performance\n", + " \n", + " print(\n", + " f\" โœ… {algorithm.upper()}: {performance['avg_query_time_ms']:.2f}ms avg, \"\n", + " f\"R@5: {performance['recall_at_5']:.3f}, R@10: {performance['recall_at_10']:.3f}\"\n", + " )\n", + " \n", + " except Exception as e:\n", + " print(f\" โŒ {algorithm.upper()} query failed: {e}\")\n", + " query_results[index_key] = None\n", + " else:\n", + " print(f\" โญ๏ธ Skipping {algorithm.upper()} (index creation failed)\")\n", + "\n", + "print(\"\\nโœ… Query performance benchmarks complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Results Analysis and Visualization\n", + "\n", + "Analyze and visualize the benchmark results with real data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Combine results into comprehensive dataset\n", + "def create_results_dataframe() -> pd.DataFrame:\n", + " \"\"\"Combine all benchmark results into a pandas DataFrame\"\"\"\n", + " \n", + " results = []\n", + " \n", + " for dim in config.dimensions:\n", + " for algorithm in config.algorithms:\n", + " key = f\"{algorithm}_{dim}\"\n", + " \n", + " if key in creation_results and creation_results[key] is not None:\n", + " creation_data = creation_results[key]\n", + " query_data_item = query_results.get(key, {})\n", + " \n", + " result = {\n", + " 'algorithm': algorithm,\n", + " 'dimensions': dim,\n", + " 'num_docs': creation_data['num_docs'],\n", + " 'build_time_sec': creation_data['build_time_sec'],\n", + " 'index_size_mb': creation_data['index_size_mb'],\n", + " 'avg_query_time_ms': query_data_item.get('avg_query_time_ms', 0),\n", + " 'recall_at_5': query_data_item.get('recall_at_5', 0),\n", + " 'recall_at_10': query_data_item.get('recall_at_10', 0)\n", + " }\n", + " \n", + " results.append(result)\n", + " \n", + " return pd.DataFrame(results)\n", + "\n", + "# Create results DataFrame\n", + "df_results = create_results_dataframe()\n", + "\n", + "print(\"๐Ÿ“Š Real Data Benchmark Results Summary:\")\n", + "print(df_results.to_string(index=False, float_format='%.3f'))\n", + "\n", + "# Display key insights\n", + "if not df_results.empty:\n", + " print(f\"\\n๐ŸŽฏ Key Insights from Real Data:\")\n", + " \n", + " # Memory efficiency\n", + " best_memory = df_results.loc[df_results['index_size_mb'].idxmin()]\n", + " print(f\"๐Ÿ† Most memory efficient: {best_memory['algorithm'].upper()} at {best_memory['dimensions']}D ({best_memory['index_size_mb']:.2f}MB)\")\n", + " \n", + " # Query speed\n", + " best_speed = df_results.loc[df_results['avg_query_time_ms'].idxmin()]\n", + " print(f\"โšก Fastest queries: {best_speed['algorithm'].upper()} at {best_speed['dimensions']}D ({best_speed['avg_query_time_ms']:.2f}ms)\")\n", + " \n", + " # Search quality\n", + " best_quality = df_results.loc[df_results['recall_at_10'].idxmax()]\n", + " print(f\"๐ŸŽฏ Best search quality: {best_quality['algorithm'].upper()} at {best_quality['dimensions']}D (R@10: {best_quality['recall_at_10']:.3f})\")\n", + " \n", + " # Dataset info\n", + " dataset_source = 'SQuAD (Hugging Face)' if 'squad_' in raw_documents[0]['doc_id'] else 'Local movies'\n", + " print(f\"\\n๐Ÿ“š Dataset: {dataset_source}\")\n", + " print(f\"๐Ÿ“Š Total documents tested: {df_results['num_docs'].iloc[0]:,}\")\n", + " print(f\"๐Ÿ” Total queries per dimension: {config.query_count}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create visualizations for real data results\n", + "def create_real_data_visualizations(df: pd.DataFrame):\n", + " \"\"\"Create visualizations for real data benchmark results\"\"\"\n", + " \n", + " if df.empty:\n", + " print(\"โš ๏ธ No results to visualize\")\n", + " return\n", + " \n", + " # Set up the plotting area\n", + " fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n", + " fig.suptitle('Real Data Vector Algorithm Benchmark Results', fontsize=16, fontweight='bold')\n", + " \n", + " # 1. Memory Usage Comparison\n", + " ax1 = axes[0, 0]\n", + " pivot_memory = df.pivot(index='dimensions', columns='algorithm', values='index_size_mb')\n", + " pivot_memory.plot(kind='bar', ax=ax1, width=0.8)\n", + " ax1.set_title('Index Size by Algorithm (Real Data)')\n", + " ax1.set_xlabel('Dimensions')\n", + " ax1.set_ylabel('Index Size (MB)')\n", + " ax1.legend(title='Algorithm')\n", + " ax1.tick_params(axis='x', rotation=0)\n", + " \n", + " # 2. Query Performance\n", + " ax2 = axes[0, 1]\n", + " pivot_query = df.pivot(index='dimensions', columns='algorithm', values='avg_query_time_ms')\n", + " pivot_query.plot(kind='bar', ax=ax2, width=0.8)\n", + " ax2.set_title('Average Query Time (Real Embeddings)')\n", + " ax2.set_xlabel('Dimensions')\n", + " ax2.set_ylabel('Query Time (ms)')\n", + " ax2.legend(title='Algorithm')\n", + " ax2.tick_params(axis='x', rotation=0)\n", + " \n", + " # 3. Search Quality\n", + " ax3 = axes[1, 0]\n", + " pivot_recall = df.pivot(index='dimensions', columns='algorithm', values='recall_at_10')\n", + " pivot_recall.plot(kind='bar', ax=ax3, width=0.8)\n", + " ax3.set_title('Search Quality (Recall@10)')\n", + " ax3.set_xlabel('Dimensions')\n", + " ax3.set_ylabel('Recall@10')\n", + " ax3.legend(title='Algorithm')\n", + " ax3.tick_params(axis='x', rotation=0)\n", + " ax3.set_ylim(0, 1.1)\n", + " \n", + " # 4. Memory Efficiency\n", + " ax4 = axes[1, 1]\n", + " df['docs_per_mb'] = df['num_docs'] / df['index_size_mb']\n", + " pivot_efficiency = df.pivot(index='dimensions', columns='algorithm', values='docs_per_mb')\n", + " pivot_efficiency.plot(kind='bar', ax=ax4, width=0.8)\n", + " ax4.set_title('Memory Efficiency (Real Data)')\n", + " ax4.set_xlabel('Dimensions')\n", + " ax4.set_ylabel('Documents per MB')\n", + " ax4.legend(title='Algorithm')\n", + " ax4.tick_params(axis='x', rotation=0)\n", + " \n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + "# Create visualizations\n", + "create_real_data_visualizations(df_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Real Data Insights and Recommendations\n", + "\n", + "Generate insights based on real data performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate real data specific recommendations\n", + "if not df_results.empty:\n", + " dataset_source = 'SQuAD (Hugging Face)' if 'squad_' in raw_documents[0]['doc_id'] else 'Local movies'\n", + " \n", + " print(\n", + " f\"๐ŸŽฏ Real Data Benchmark Insights\",\n", + " f\"Dataset: {dataset_source}\",\n", + " f\"Documents: {df_results['num_docs'].iloc[0]:,} per dimension\",\n", + " f\"Embedding Models: sentence-transformers\",\n", + " \"=\" * 50,\n", + " sep=\"\\n\"\n", + " )\n", + " \n", + " for dim in config.dimensions:\n", + " dim_data = df_results[df_results['dimensions'] == dim]\n", + " \n", + " if not dim_data.empty:\n", + " print(f\"\\n๐Ÿ“Š {dim}D Embeddings Analysis:\")\n", + " \n", + " for _, row in dim_data.iterrows():\n", + " algo = row['algorithm'].upper()\n", + " print(\n", + " f\" {algo}:\",\n", + " f\" Index: {row['index_size_mb']:.2f}MB\",\n", + " f\" Query: {row['avg_query_time_ms']:.2f}ms\",\n", + " f\" Recall@10: {row['recall_at_10']:.3f}\",\n", + " f\" Efficiency: {row['docs_per_mb']:.1f} docs/MB\",\n", + " sep=\"\\n\"\n", + " )\n", + " \n", + " print(\n", + " f\"\\n๐Ÿ’ก Key Takeaways with Real Data:\",\n", + " \"โ€ข Real embeddings show different performance characteristics than synthetic\",\n", + " \"โ€ข Sentence-transformer models provide realistic vector distributions\",\n", + " \"โ€ข SQuAD Q&A pairs offer diverse semantic content for testing\",\n", + " \"โ€ข Results are more representative of production workloads\",\n", + " \"โ€ข Consider testing with your specific embedding models and data\",\n", + " sep=\"\\n\"\n", + " )\n", + "else:\n", + " print(\"โš ๏ธ No results available for analysis\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 7: Cleanup\n", + "\n", + "Clean up benchmark indices to free memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Clean up all benchmark indices\n", + "print(\"๐Ÿงน Cleaning up benchmark indices...\")\n", + "\n", + "cleanup_count = 0\n", + "for index_key, index in indices.items():\n", + " try:\n", + " index.delete(drop=True)\n", + " cleanup_count += 1\n", + " print(f\" โœ… Deleted {index_key}\")\n", + " except Exception as e:\n", + " print(f\" โš ๏ธ Failed to delete {index_key}: {e}\")\n", + "\n", + "dataset_source = 'SQuAD (Hugging Face)' if 'squad_' in raw_documents[0]['doc_id'] else 'Local movies'\n", + "\n", + "print(\n", + " f\"\\n๐ŸŽ‰ Real Data Benchmark Complete!\",\n", + " f\"Dataset: {dataset_source}\",\n", + " f\"Cleaned up {cleanup_count} indices\",\n", + " f\"\\nNext steps:\",\n", + " \"1. Review the real data performance characteristics above\",\n", + " \"2. Compare with synthetic data results if available\",\n", + " \"3. Test with your specific embedding models and datasets\",\n", + " \"4. Scale up with larger datasets for production insights\",\n", + " \"5. Consider the impact of real text diversity on algorithm performance\",\n", + " sep=\"\\n\"\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}