Skip to content

Latest commit

 

History

History
169 lines (130 loc) · 4.31 KB

File metadata and controls

169 lines (130 loc) · 4.31 KB

RuVector Distance Operators - Quick Reference

🚀 Zero-Copy Operators (Use These!)

All operators use SIMD-optimized zero-copy access automatically.

SQL Operators

-- L2 (Euclidean) Distance
SELECT * FROM items ORDER BY embedding <-> '[1,2,3]' LIMIT 10;

-- Inner Product (Maximum similarity)
SELECT * FROM items ORDER BY embedding <#> '[1,2,3]' LIMIT 10;

-- Cosine Distance (Semantic similarity)
SELECT * FROM items ORDER BY embedding <=> '[1,2,3]' LIMIT 10;

-- L1 (Manhattan) Distance
SELECT * FROM items ORDER BY embedding <+> '[1,2,3]' LIMIT 10;

Function Forms

-- When you need the distance value explicitly
SELECT
    id,
    ruvector_l2_distance(embedding, '[1,2,3]') as l2_dist,
    ruvector_ip_distance(embedding, '[1,2,3]') as ip_dist,
    ruvector_cosine_distance(embedding, '[1,2,3]') as cos_dist,
    ruvector_l1_distance(embedding, '[1,2,3]') as l1_dist
FROM items;

📊 Operator Comparison

Operator Math Formula Range Best For
<-> √Σ(aᵢ-bᵢ)² [0, ∞) General similarity, geometry
<#> -Σ(aᵢ×bᵢ) (-∞, ∞) MIPS, recommendations
<=> 1-(a·b)/(‖a‖‖b‖) [0, 2] Text, semantic search
<+> Σ|aᵢ-bᵢ| [0, ∞) Sparse vectors, L1 norm

💡 Common Patterns

Nearest Neighbors

-- Find 10 nearest neighbors
SELECT id, content, embedding <-> $query AS dist
FROM documents
ORDER BY embedding <-> $query
LIMIT 10;

Filtered Search

-- Search within a category
SELECT * FROM products
WHERE category = 'electronics'
ORDER BY embedding <=> $query
LIMIT 20;

Distance Threshold

-- Find all items within distance 0.5
SELECT * FROM items
WHERE embedding <-> $query < 0.5;

Batch Distances

-- Compare one vector against many
SELECT id, embedding <-> '[1,2,3]' AS distance
FROM items
WHERE id IN (1, 2, 3, 4, 5);

🏗️ Index Creation

-- HNSW index (best for most cases)
CREATE INDEX ON items USING hnsw (embedding ruvector_l2_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat index (good for large datasets)
CREATE INDEX ON items USING ivfflat (embedding ruvector_cosine_ops)
WITH (lists = 100);

⚡ Performance Tips

  1. Use RuVector type, not arrays: ruvector type enables zero-copy
  2. Create indexes: Essential for large datasets
  3. Normalize for cosine: Pre-normalize vectors if using cosine often
  4. Check SIMD: Run SELECT ruvector_simd_info() to verify acceleration

🔄 Migration from pgvector

RuVector operators are drop-in compatible with pgvector:

-- pgvector syntax works unchanged
SELECT * FROM items ORDER BY embedding <-> '[1,2,3]' LIMIT 10;

-- Just change the type from 'vector' to 'ruvector'
ALTER TABLE items ALTER COLUMN embedding TYPE ruvector(384);

📏 Dimension Support

  • Maximum: 16,000 dimensions
  • Recommended: 128-2048 for most use cases
  • Performance: Optimal at multiples of 16 (AVX-512) or 8 (AVX2)

🐛 Debugging

-- Check SIMD support
SELECT ruvector_simd_info();

-- Verify vector dimensions
SELECT array_length(embedding::float4[], 1) FROM items LIMIT 1;

-- Test distance calculation
SELECT '[1,2,3]'::ruvector <-> '[4,5,6]'::ruvector;
-- Should return: 5.196152 (≈√27)

🎯 Choosing the Right Metric

Your Data Recommended Operator
Text embeddings (BERT, OpenAI) <=> (cosine)
Image features (ResNet, CLIP) <-> (L2)
Recommender systems <#> (inner product)
Document vectors (TF-IDF) <=> (cosine)
Sparse features <+> (L1)
General floating-point <-> (L2)

✅ Validation

-- Test basic functionality
CREATE TEMP TABLE test_vectors (v ruvector(3));
INSERT INTO test_vectors VALUES ('[1,2,3]'), ('[4,5,6]');

-- Should return distances
SELECT a.v <-> b.v AS l2,
       a.v <#> b.v AS ip,
       a.v <=> b.v AS cosine,
       a.v <+> b.v AS l1
FROM test_vectors a, test_vectors b
WHERE a.v <> b.v;

Expected output:

   l2    |   ip    |  cosine  |  l1
---------+---------+----------+------
 5.19615 | -32.000 | 0.025368 | 9.00

📚 Further Reading