This document describes the HNSW (Hierarchical Navigable Small World) index implementation as a PostgreSQL Access Method for the RuVector extension.
HNSW is a graph-based algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces. It provides:
- Logarithmic search complexity: O(log N) average case
- High recall: >95% recall achievable with proper parameters
- Incremental updates: Supports efficient insertions and deletions
- Multi-layer graph structure: Hierarchical organization for fast traversal
The HNSW index stores data in PostgreSQL pages for durability and memory management:
Page 0 (Metadata):
├─ Magic number: 0x484E5357 ("HNSW")
├─ Version: 1
├─ Dimensions: Vector dimensionality
├─ Parameters: m, m0, ef_construction
├─ Entry point: Block number of top-level node
├─ Max layer: Highest layer in the graph
└─ Metric: Distance metric (L2/Cosine/IP)
Page 1+ (Node Pages):
├─ Node Header:
│ ├─ Page type: HNSW_PAGE_NODE
│ ├─ Max layer: Highest layer for this node
│ └─ Item pointer: TID of heap tuple
├─ Vector data: [f32; dimensions]
├─ Layer 0 neighbors: [BlockNumber; m0]
└─ Layer 1+ neighbors: [[BlockNumber; m]; max_layer]
The implementation provides all required PostgreSQL index AM callbacks:
ambuild- Builds index from table dataambuildempty- Creates empty index structureaminsert- Inserts a single vectorambulkdelete- Bulk deletion supportamvacuumcleanup- Vacuum cleanup operationsamcostestimate- Query cost estimationamgettuple- Sequential tuple retrievalamgetbitmap- Bitmap scan supportamcanreturn- Index-only scan capabilityamoptions- Index option parsing
-- Basic index creation (L2 distance, default parameters)
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
-- With custom parameters
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops)
WITH (m = 32, ef_construction = 128);
-- Cosine distance
CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
-- Inner product
CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);-- Find 10 nearest neighbors using L2 distance
SELECT id, embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
FROM items
ORDER BY embedding <-> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;
-- Find 10 nearest neighbors using cosine distance
SELECT id, embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
FROM items
ORDER BY embedding <=> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;
-- Find vectors with largest inner product
SELECT id, embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] AS neg_ip
FROM items
ORDER BY embedding <#> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
m |
integer | 16 | 2-128 | Maximum connections per layer |
ef_construction |
integer | 64 | 4-1000 | Size of dynamic candidate list during build |
metric |
string | 'l2' | l2/cosine/ip | Distance metric |
Parameter Tuning Guidelines:
-
m: Higher values improve recall but increase memory usage- Low (8-16): Fast build, lower memory, good for small datasets
- Medium (16-32): Balanced performance
- High (32-64): Better recall, slower build, more memory
-
ef_construction: Higher values improve index quality but slow down build- Low (32-64): Fast build, may sacrifice recall
- Medium (64-128): Balanced
- High (128-500): Best quality, slow build
| Parameter | Type | Default | Description |
|---|---|---|---|
ruvector.ef_search |
integer | 40 | Size of dynamic candidate list during search |
Setting ef_search:
-- Global setting (postgresql.conf or ALTER SYSTEM)
ALTER SYSTEM SET ruvector.ef_search = 100;
-- Session setting (per-connection)
SET ruvector.ef_search = 100;
-- Query with increased recall
SET LOCAL ruvector.ef_search = 200;
SELECT ... ORDER BY embedding <-> query LIMIT 10;- Operator:
<-> - Formula:
√(Σ(a[i] - b[i])²) - Use case: General-purpose distance
- Range: [0, ∞)
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
SELECT * FROM items ORDER BY embedding <-> query_vector LIMIT 10;- Operator:
<=> - Formula:
1 - (a·b)/(||a||·||b||) - Use case: Direction similarity (text embeddings)
- Range: [0, 2]
CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
SELECT * FROM items ORDER BY embedding <=> query_vector LIMIT 10;- Operator:
<#> - Formula:
-Σ(a[i] * b[i]) - Use case: Maximum similarity (normalized vectors)
- Range: (-∞, ∞)
CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
SELECT * FROM items ORDER BY embedding <#> query_vector LIMIT 10;- Time Complexity: O(N log N) with high probability
- Space Complexity: O(N * M * L) where L is average layer count
- Typical Build Rate: 1000-10000 vectors/sec (depends on dimensions)
- Time Complexity: O(ef_search * log N)
- Typical Query Time:
- <1ms for 100K vectors (128D)
- <5ms for 1M vectors (128D)
- <10ms for 10M vectors (128D)
Memory per vector ≈ dimensions * 4 bytes + m * 8 bytes * average_layers
Average layers ≈ log₂(N) / log₂(m)
Example (1M vectors, 128D, m=16):
- Vector data: 1M * 128 * 4 = 512 MB
- Graph edges: 1M * 16 * 8 * 4 = 512 MB
- Total: ~1 GB
For L2 (Euclidean) distance on real[] vectors.
CREATE OPERATOR CLASS hnsw_l2_ops
FOR TYPE real[] USING hnsw
FAMILY hnsw_l2_ops AS
OPERATOR 1 <-> (real[], real[]) FOR ORDER BY float_ops,
FUNCTION 1 l2_distance_arr(real[], real[]);For cosine distance on real[] vectors.
CREATE OPERATOR CLASS hnsw_cosine_ops
FOR TYPE real[] USING hnsw
FAMILY hnsw_cosine_ops AS
OPERATOR 1 <=> (real[], real[]) FOR ORDER BY float_ops,
FUNCTION 1 cosine_distance_arr(real[], real[]);For inner product on real[] vectors.
CREATE OPERATOR CLASS hnsw_ip_ops
FOR TYPE real[] USING hnsw
FAMILY hnsw_ip_ops AS
OPERATOR 1 <#> (real[], real[]) FOR ORDER BY float_ops,
FUNCTION 1 neg_inner_product_arr(real[], real[]);-- View memory usage
SELECT ruvector_memory_stats();
-- Check index size
SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));
-- View index definition
SELECT indexdef FROM pg_indexes WHERE indexname = 'items_embedding_idx';-- Perform maintenance (optimize connections, rebuild degraded nodes)
SELECT ruvector_index_maintenance('items_embedding_idx');
-- Vacuum to reclaim space after deletes
VACUUM items;
-- Rebuild index if heavily modified
REINDEX INDEX items_embedding_idx;-- Analyze query execution
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, embedding <-> query AS distance
FROM items
ORDER BY embedding <-> query
LIMIT 10;- Build indexes on stable data when possible
- Use higher
ef_constructionfor better quality - Consider using
maintenance_work_memfor large builds:SET maintenance_work_mem = '2GB'; CREATE INDEX ...;
- Adjust
ef_searchbased on recall requirements - Use prepared statements for repeated queries
- Consider query result caching for common queries
- Normalize vectors for cosine similarity
- Batch inserts when possible
- Schedule index maintenance during low-traffic periods
- Track index size growth
- Monitor query performance metrics
- Set up alerts for memory usage
- Single column only: Multi-column indexes not supported
- No parallel scans: Query parallelism not yet implemented
- No index-only scans: Must access heap tuples
- Array type only: Custom vector type support coming soon
- PostgreSQL 14+
- pgrx 0.12+
Problem: Out of memory during index build
Solution: Increase maintenance_work_mem or reduce ef_construction
SET maintenance_work_mem = '4GB';Problem: Queries are slower than expected
Solution: Increase ef_search or rebuild index with higher m
SET ruvector.ef_search = 100;Problem: Not finding correct nearest neighbors
Solution: Increase ef_search or rebuild with higher ef_construction
REINDEX INDEX items_embedding_idx;| Feature | HNSW | IVFFlat | Brute Force |
|---|---|---|---|
| Search Time | O(log N) | O(√N) | O(N) |
| Build Time | O(N log N) | O(N) | O(1) |
| Memory | High | Medium | Low |
| Recall | >95% | >90% | 100% |
| Updates | Good | Poor | Excellent |
- Parallel index scans
- Custom vector type support
- Index-only scans
- Dynamic parameter tuning
- Graph compression
- Multi-column indexes
- Distributed HNSW
-
Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE transactions on pattern analysis and machine intelligence.
-
PostgreSQL Index Access Method documentation: https://www.postgresql.org/docs/current/indexam.html
-
pgrx documentation: https://github.com/pgcentralfoundation/pgrx
MIT License - See LICENSE file for details.