Skip to content

Commit 15ebbbc

Browse files
authored
GigaMap JVector Index (#548)
1 parent ff60586 commit 15ebbbc

32 files changed

+15357
-0
lines changed
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
= JVector Advanced Usage
2+
3+
This section covers advanced usage patterns for production deployments.
4+
5+
== On-Disk Index with Compression
6+
7+
For large datasets that exceed available memory, combine on-disk storage with Product Quantization (PQ) compression.
8+
9+
[source, java]
10+
----
11+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
12+
.dimension(768)
13+
.similarityFunction(VectorSimilarityFunction.COSINE)
14+
.maxDegree(32)
15+
.beamWidth(200)
16+
// On-disk storage
17+
.onDisk(true)
18+
.indexDirectory(Path.of("/data/vectors"))
19+
// PQ compression (reduces memory significantly)
20+
.enablePqCompression(true)
21+
.pqSubspaces(48) // Must divide dimension evenly
22+
.build();
23+
----
24+
25+
NOTE: PQ compression automatically enforces `maxDegree=32` due to the FusedPQ algorithm constraint.
26+
27+
== Production Configuration
28+
29+
For production systems with continuous updates, enable both background persistence and optimization.
30+
31+
[source, java]
32+
----
33+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
34+
.dimension(768)
35+
.similarityFunction(VectorSimilarityFunction.COSINE)
36+
// On-disk storage
37+
.onDisk(true)
38+
.indexDirectory(Path.of("/data/vectors"))
39+
// Background persistence (async, non-blocking, enabled by setting interval > 0)
40+
.persistenceIntervalMs(30_000) // Enable, check every 30 seconds
41+
.minChangesBetweenPersists(100) // Only persist if >= 100 changes
42+
.persistOnShutdown(true) // Persist on close()
43+
// Background optimization (periodic cleanup, enabled by setting interval > 0)
44+
.optimizationIntervalMs(60_000) // Enable, check every 60 seconds
45+
.minChangesBetweenOptimizations(1000) // Only optimize if >= 1000 changes
46+
.optimizeOnShutdown(false) // Skip for faster shutdown
47+
.build();
48+
----
49+
50+
== Manual Optimization and Persistence
51+
52+
For fine-grained control, you can manually trigger optimization and persistence.
53+
54+
[source, java]
55+
----
56+
// Optimize graph (removes excess neighbors, improves query latency)
57+
index.optimize();
58+
59+
// Persist to disk (for on-disk indices)
60+
index.persistToDisk();
61+
62+
// Close index (runs shutdown hooks based on config)
63+
index.close();
64+
----
65+
66+
== Multiple Vector Indices
67+
68+
You can create multiple vector indices for different embedding types on the same entity.
69+
70+
[source, java]
71+
----
72+
GigaMap<Document> gigaMap = GigaMap.New();
73+
VectorIndices<Document> vectorIndices = gigaMap.index().register(VectorIndices.Category());
74+
75+
// Title embeddings (smaller dimension)
76+
VectorIndexConfiguration titleConfig = VectorIndexConfiguration.builder()
77+
.dimension(384)
78+
.similarityFunction(VectorSimilarityFunction.COSINE)
79+
.build();
80+
VectorIndex<Document> titleIndex = vectorIndices.add("title", titleConfig, new TitleVectorizer());
81+
82+
// Content embeddings (larger dimension)
83+
VectorIndexConfiguration contentConfig = VectorIndexConfiguration.builder()
84+
.dimension(768)
85+
.similarityFunction(VectorSimilarityFunction.COSINE)
86+
.build();
87+
VectorIndex<Document> contentIndex = vectorIndices.add("content", contentConfig, new ContentVectorizer());
88+
----
89+
90+
== Hybrid Search
91+
92+
Combine vector similarity search with traditional bitmap index filtering.
93+
94+
[source, java]
95+
----
96+
// First, filter by category using bitmap index
97+
List<Long> categoryIds = bitmapIndex.query(
98+
categoryIndexer.is("technology")
99+
).toList();
100+
101+
// Then search within filtered results
102+
VectorSearchResult<Document> result = vectorIndex.search(queryVector, 10);
103+
104+
// Combine results
105+
List<Document> hybridResults = result.stream()
106+
.filter(e -> categoryIds.contains(e.entityId()))
107+
.map(VectorSearchResult.Entry::entity)
108+
.toList();
109+
----
110+
111+
== Vectorizer Implementations
112+
113+
=== Embedded Vectors
114+
115+
When the vector is stored directly in the entity, set `isEmbedded()` to `true` to avoid duplicate storage.
116+
117+
[source, java]
118+
----
119+
public class DocumentVectorizer extends Vectorizer<Document>
120+
{
121+
@Override
122+
public float[] vectorize(Document entity)
123+
{
124+
return entity.embedding();
125+
}
126+
127+
@Override
128+
public boolean isEmbedded()
129+
{
130+
return true;
131+
}
132+
}
133+
----
134+
135+
=== Computed Vectors
136+
137+
When vectors are computed on-the-fly or fetched from an external service, set `isEmbedded()` to `false`.
138+
139+
[source, java]
140+
----
141+
public class TextVectorizer extends Vectorizer<Document>
142+
{
143+
private final EmbeddingService embeddingService;
144+
145+
public TextVectorizer(EmbeddingService embeddingService)
146+
{
147+
this.embeddingService = embeddingService;
148+
}
149+
150+
@Override
151+
public float[] vectorize(Document entity)
152+
{
153+
return embeddingService.embed(entity.text());
154+
}
155+
156+
@Override
157+
public boolean isEmbedded()
158+
{
159+
return false; // Vector will be stored separately
160+
}
161+
}
162+
----
163+
164+
CAUTION: When using computed vectors with an external service, be aware of potential latency during indexing operations. Consider pre-computing embeddings and storing them in the entity for better performance.
165+
166+
IMPORTANT: The `vectorize()` method must never return `null`. If it does, an `IllegalStateException` is thrown when the entity is added, updated, or re-indexed. If some entities cannot produce a vector, they should be excluded before adding them to the GigaMap, or the vectorizer must provide a fallback vector.
167+
168+
== Benchmarking
169+
170+
The library includes benchmark tests following https://ann-benchmarks.com/[ANN-Benchmarks] methodology.
171+
172+
[source, bash]
173+
----
174+
# Run benchmark tests (disabled by default)
175+
mvn test -Dtest=VectorIndexBenchmarkTest \
176+
-Djunit.jupiter.conditions.deactivate=org.junit.*DisabledCondition
177+
----
178+
179+
=== Benchmark Results (10K vectors, 128 dimensions)
180+
181+
[options="header",cols="1,1"]
182+
|===
183+
|Metric |Result
184+
185+
|*Recall@10* (clustered data)
186+
|94.3%
187+
188+
|*Recall@50* (clustered data)
189+
|100%
190+
191+
|*QPS* (queries/second)
192+
|~10,000+
193+
194+
|*Average latency*
195+
|< 0.1ms
196+
197+
|*p99 latency*
198+
|< 0.2ms
199+
|===

0 commit comments

Comments
 (0)