HNSW Index Implementation

Overview

This document describes the HNSW (Hierarchical Navigable Small World) index implementation as a PostgreSQL Access Method for the RuVector extension.

What is HNSW?

HNSW is a graph-based algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces. It provides:

Logarithmic search complexity: O(log N) average case
High recall: >95% recall achievable with proper parameters
Incremental updates: Supports efficient insertions and deletions
Multi-layer graph structure: Hierarchical organization for fast traversal

Architecture

Page-Based Storage

The HNSW index stores data in PostgreSQL pages for durability and memory management:

Page 0 (Metadata):
├─ Magic number: 0x484E5357 ("HNSW")
├─ Version: 1
├─ Dimensions: Vector dimensionality
├─ Parameters: m, m0, ef_construction
├─ Entry point: Block number of top-level node
├─ Max layer: Highest layer in the graph
└─ Metric: Distance metric (L2/Cosine/IP)

Page 1+ (Node Pages):
├─ Node Header:
│  ├─ Page type: HNSW_PAGE_NODE
│  ├─ Max layer: Highest layer for this node
│  └─ Item pointer: TID of heap tuple
├─ Vector data: [f32; dimensions]
├─ Layer 0 neighbors: [BlockNumber; m0]
└─ Layer 1+ neighbors: [[BlockNumber; m]; max_layer]

Access Method Callbacks

The implementation provides all required PostgreSQL index AM callbacks:

ambuild - Builds index from table data
ambuildempty - Creates empty index structure
aminsert - Inserts a single vector
ambulkdelete - Bulk deletion support
amvacuumcleanup - Vacuum cleanup operations
amcostestimate - Query cost estimation
amgettuple - Sequential tuple retrieval
amgetbitmap - Bitmap scan support
amcanreturn - Index-only scan capability
amoptions - Index option parsing

Usage

Creating an HNSW Index

-- Basic index creation (L2 distance, default parameters)
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);

-- With custom parameters
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops)
    WITH (m = 32, ef_construction = 128);

-- Cosine distance
CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);

-- Inner product
CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);

Querying

-- Find 10 nearest neighbors using L2 distance
SELECT id, embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
FROM items
ORDER BY embedding <-> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;

-- Find 10 nearest neighbors using cosine distance
SELECT id, embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
FROM items
ORDER BY embedding <=> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;

-- Find vectors with largest inner product
SELECT id, embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] AS neg_ip
FROM items
ORDER BY embedding <#> ARRAY[0.1, 0.2, 0.3]::real[]
LIMIT 10;

Parameters

Index Build Parameters

Parameter	Type	Default	Range	Description
`m`	integer	16	2-128	Maximum connections per layer
`ef_construction`	integer	64	4-1000	Size of dynamic candidate list during build
`metric`	string	'l2'	l2/cosine/ip	Distance metric

Parameter Tuning Guidelines:

m: Higher values improve recall but increase memory usage
- Low (8-16): Fast build, lower memory, good for small datasets
- Medium (16-32): Balanced performance
- High (32-64): Better recall, slower build, more memory
ef_construction: Higher values improve index quality but slow down build
- Low (32-64): Fast build, may sacrifice recall
- Medium (64-128): Balanced
- High (128-500): Best quality, slow build

Query-Time Parameters

Parameter	Type	Default	Description
`ruvector.ef_search`	integer	40	Size of dynamic candidate list during search

Setting ef_search:

-- Global setting (postgresql.conf or ALTER SYSTEM)
ALTER SYSTEM SET ruvector.ef_search = 100;

-- Session setting (per-connection)
SET ruvector.ef_search = 100;

-- Query with increased recall
SET LOCAL ruvector.ef_search = 200;
SELECT ... ORDER BY embedding <-> query LIMIT 10;

Distance Metrics

L2 (Euclidean) Distance

Operator: <->
Formula: √(Σ(a[i] - b[i])²)
Use case: General-purpose distance
Range: [0, ∞)

CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
SELECT * FROM items ORDER BY embedding <-> query_vector LIMIT 10;

Cosine Distance

Operator: <=>
Formula: 1 - (a·b)/(||a||·||b||)
Use case: Direction similarity (text embeddings)
Range: [0, 2]

CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
SELECT * FROM items ORDER BY embedding <=> query_vector LIMIT 10;

Inner Product

Operator: <#>
Formula: -Σ(a[i] * b[i])
Use case: Maximum similarity (normalized vectors)
Range: (-∞, ∞)

CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
SELECT * FROM items ORDER BY embedding <#> query_vector LIMIT 10;

Performance

Build Performance

Time Complexity: O(N log N) with high probability
Space Complexity: O(N * M * L) where L is average layer count
Typical Build Rate: 1000-10000 vectors/sec (depends on dimensions)

Query Performance

Time Complexity: O(ef_search * log N)
Typical Query Time:
- <1ms for 100K vectors (128D)
- <5ms for 1M vectors (128D)
- <10ms for 10M vectors (128D)

Memory Usage

Memory per vector ≈ dimensions * 4 bytes + m * 8 bytes * average_layers
Average layers ≈ log₂(N) / log₂(m)

Example (1M vectors, 128D, m=16):
- Vector data: 1M * 128 * 4 = 512 MB
- Graph edges: 1M * 16 * 8 * 4 = 512 MB
- Total: ~1 GB

Operator Classes

hnsw_l2_ops

For L2 (Euclidean) distance on real[] vectors.

CREATE OPERATOR CLASS hnsw_l2_ops
    FOR TYPE real[] USING hnsw
    FAMILY hnsw_l2_ops AS
    OPERATOR 1 <-> (real[], real[]) FOR ORDER BY float_ops,
    FUNCTION 1 l2_distance_arr(real[], real[]);

hnsw_cosine_ops

For cosine distance on real[] vectors.

CREATE OPERATOR CLASS hnsw_cosine_ops
    FOR TYPE real[] USING hnsw
    FAMILY hnsw_cosine_ops AS
    OPERATOR 1 <=> (real[], real[]) FOR ORDER BY float_ops,
    FUNCTION 1 cosine_distance_arr(real[], real[]);

hnsw_ip_ops

For inner product on real[] vectors.

CREATE OPERATOR CLASS hnsw_ip_ops
    FOR TYPE real[] USING hnsw
    FAMILY hnsw_ip_ops AS
    OPERATOR 1 <#> (real[], real[]) FOR ORDER BY float_ops,
    FUNCTION 1 neg_inner_product_arr(real[], real[]);

Monitoring and Maintenance

Index Statistics

-- View memory usage
SELECT ruvector_memory_stats();

-- Check index size
SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));

-- View index definition
SELECT indexdef FROM pg_indexes WHERE indexname = 'items_embedding_idx';

Index Maintenance

-- Perform maintenance (optimize connections, rebuild degraded nodes)
SELECT ruvector_index_maintenance('items_embedding_idx');

-- Vacuum to reclaim space after deletes
VACUUM items;

-- Rebuild index if heavily modified
REINDEX INDEX items_embedding_idx;

Query Plan Analysis

-- Analyze query execution
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, embedding <-> query AS distance
FROM items
ORDER BY embedding <-> query
LIMIT 10;

Best Practices

1. Index Creation

Build indexes on stable data when possible
Use higher ef_construction for better quality

Consider using maintenance_work_mem for large builds:

SET maintenance_work_mem = '2GB';
CREATE INDEX ...;

2. Query Optimization

Adjust ef_search based on recall requirements
Use prepared statements for repeated queries
Consider query result caching for common queries

3. Data Management

Normalize vectors for cosine similarity
Batch inserts when possible
Schedule index maintenance during low-traffic periods

4. Monitoring

Track index size growth
Monitor query performance metrics
Set up alerts for memory usage

Limitations

Current Version

Single column only: Multi-column indexes not supported
No parallel scans: Query parallelism not yet implemented
No index-only scans: Must access heap tuples
Array type only: Custom vector type support coming soon

PostgreSQL Version Requirements

PostgreSQL 14+
pgrx 0.12+

Troubleshooting

Index Build Fails

Problem: Out of memory during index build Solution: Increase maintenance_work_mem or reduce ef_construction

SET maintenance_work_mem = '4GB';

Slow Queries

Problem: Queries are slower than expected Solution: Increase ef_search or rebuild index with higher m

SET ruvector.ef_search = 100;

Low Recall

Problem: Not finding correct nearest neighbors Solution: Increase ef_search or rebuild with higher ef_construction

REINDEX INDEX items_embedding_idx;

Comparison with Other Methods

Feature	HNSW	IVFFlat	Brute Force
Search Time	O(log N)	O(√N)	O(N)
Build Time	O(N log N)	O(N)	O(1)
Memory	High	Medium	Low
Recall	>95%	>90%	100%
Updates	Good	Poor	Excellent

Future Enhancements

References

Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE transactions on pattern analysis and machine intelligence.
PostgreSQL Index Access Method documentation: https://www.postgresql.org/docs/current/indexam.html
pgrx documentation: https://github.com/pgcentralfoundation/pgrx

License

MIT License - See LICENSE file for details.

FilesExpand file tree

HNSW_INDEX.md

Latest commit

History