Skip to content

Commit 94c2d0c

Browse files
committed
add parameter config
1 parent 3073c2a commit 94c2d0c

File tree

5 files changed

+291
-8
lines changed

5 files changed

+291
-8
lines changed

paimon-diskann/PARAMETER_TUNING.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# DiskANN Parameter Tuning Guide
21+
22+
This document provides guidance on tuning DiskANN vector index parameters for optimal performance in Apache Paimon.
23+
24+
## Overview
25+
26+
DiskANN is a graph-based approximate nearest neighbor (ANN) search algorithm designed for efficient billion-point vector search. The implementation in Paimon provides several parameters to control the trade-offs between accuracy, speed, and resource usage.
27+
28+
## Key Parameters
29+
30+
### 1. Graph Construction Parameters
31+
32+
#### `vector.diskann.max-degree` (R)
33+
- **Default**: 64
34+
- **Range**: 32-128
35+
- **Description**: Maximum degree (number of connections) for each node in the graph
36+
- **Impact**:
37+
- Higher values → Better recall, higher memory usage, longer build time
38+
- Lower values → Faster build, lower memory, potentially lower recall
39+
- **Recommendations**:
40+
- **32**: For memory-constrained environments or when build time is critical
41+
- **64**: Balanced default (Microsoft recommended)
42+
- **128**: For maximum recall when resources permit
43+
44+
#### `vector.diskann.build-list-size` (L)
45+
- **Default**: 100
46+
- **Range**: 50-200
47+
- **Description**: Size of the candidate list during graph construction
48+
- **Impact**:
49+
- Higher values → Better graph quality, longer build time
50+
- Lower values → Faster build, potentially lower recall
51+
- **Recommendations**:
52+
- Use default 100 for most cases
53+
- Increase to 150-200 for very high-dimensional data (>512 dimensions)
54+
55+
### 2. Search Parameters
56+
57+
#### `vector.diskann.search-list-size` (L)
58+
- **Default**: 100
59+
- **Range**: 16-500
60+
- **Description**: Size of the candidate list during search
61+
- **Impact**:
62+
- Higher values → Better recall, higher latency
63+
- Lower values → Lower latency, potentially lower recall
64+
- **Dynamic Behavior**: The implementation automatically adjusts this to be at least equal to the requested `k` (number of results)
65+
- **Recommendations**:
66+
- **16-32**: For latency-critical applications (QPS > 5000)
67+
- **100**: Balanced default
68+
- **200-500**: For maximum recall (recall > 95%)
69+
70+
#### `vector.search-factor`
71+
- **Default**: 10
72+
- **Range**: 5-20
73+
- **Description**: Multiplier for search limit when row filtering is applied
74+
- **Impact**: When filtering by row IDs, fetches `limit * search-factor` results to ensure sufficient matches after filtering
75+
- **Recommendations**:
76+
- **5**: When filtering is selective (<10% of data)
77+
- **10**: Default for typical filtering scenarios
78+
- **20**: When filtering is very broad (>50% of data)
79+
80+
### 3. Data Configuration
81+
82+
#### `vector.dim`
83+
- **Default**: 128
84+
- **Description**: Dimension of the vectors
85+
- **Recommendations**:
86+
- Must match your embedding model
87+
- Common values: 128, 256, 384, 512, 768, 1024
88+
89+
#### `vector.metric`
90+
- **Default**: L2
91+
- **Options**: L2, INNER_PRODUCT, COSINE
92+
- **Description**: Distance metric for similarity computation
93+
- **Recommendations**:
94+
- **L2**: For Euclidean distance (most common)
95+
- **INNER_PRODUCT**: For dot product similarity (use with normalized vectors)
96+
- **COSINE**: For cosine similarity
97+
98+
#### `vector.normalize`
99+
- **Default**: false
100+
- **Description**: Whether to L2-normalize vectors before indexing/searching
101+
- **Recommendations**:
102+
- **true**: When using COSINE metric or when vectors have varying magnitudes
103+
- **false**: When vectors are already normalized or using L2 metric
104+
105+
### 4. Index Organization
106+
107+
#### `vector.size-per-index`
108+
- **Default**: 2,000,000
109+
- **Description**: Number of vectors per index file
110+
- **Impact**:
111+
- Larger values → Fewer files, higher memory per index, better search efficiency
112+
- Smaller values → More files, lower memory per index, more overhead
113+
- **Recommendations**:
114+
- **500,000**: For small datasets or memory-constrained environments
115+
- **2,000,000**: Default for balanced performance
116+
- **5,000,000+**: For large-scale production systems with ample resources
117+
118+
#### `vector.diskann.index-type`
119+
- **Default**: MEMORY
120+
- **Options**: MEMORY, DISK
121+
- **Description**: Type of index structure
122+
- **Recommendations**:
123+
- **MEMORY**: For datasets that fit in RAM (best performance)
124+
- **DISK**: For datasets exceeding RAM (requires SSD)
125+
126+
## Performance Tuning Guide
127+
128+
### High Recall (>95%)
129+
```properties
130+
vector.diskann.max-degree = 128
131+
vector.diskann.build-list-size = 150
132+
vector.diskann.search-list-size = 200
133+
```
134+
135+
### Balanced (90-95% recall)
136+
```properties
137+
vector.diskann.max-degree = 64
138+
vector.diskann.build-list-size = 100
139+
vector.diskann.search-list-size = 100
140+
```
141+
142+
### High QPS (Low Latency)
143+
```properties
144+
vector.diskann.max-degree = 32
145+
vector.diskann.build-list-size = 75
146+
vector.diskann.search-list-size = 32
147+
```
148+
149+
### Memory-Constrained
150+
```properties
151+
vector.diskann.max-degree = 32
152+
vector.diskann.build-list-size = 75
153+
vector.size-per-index = 500000
154+
vector.diskann.index-type = DISK
155+
```
156+
157+
## Best Practices
158+
159+
1. **Start with defaults**: The default parameters are tuned for balanced performance
160+
2. **Measure first**: Profile your workload before tuning
161+
3. **Tune incrementally**: Change one parameter at a time and measure impact
162+
4. **Consider trade-offs**: Higher recall typically means higher latency and resource usage
163+
5. **Test with production data**: Parameter effectiveness depends on data characteristics
164+
165+
## Advanced Parameters (Future Enhancement)
166+
167+
The following parameters are documented in the official Microsoft DiskANN implementation but are not yet exposed in the current Rust-based native library:
168+
169+
- **alpha** (default: 1.2): Controls the graph construction pruning strategy
170+
- **saturate_graph** (default: true): Whether to saturate the graph during construction
171+
172+
These parameters may be added in future versions when the underlying Rust DiskANN crate exposes them through its configuration API.
173+
174+
## Performance Metrics
175+
176+
When tuning parameters, monitor these metrics:
177+
- **Recall**: Percentage of true nearest neighbors found
178+
- **QPS (Queries Per Second)**: Throughput of search operations
179+
- **Latency**: Time to complete a single query (p50, p95, p99)
180+
- **Memory Usage**: RAM consumed by indices
181+
- **Build Time**: Time to construct the index
182+
183+
## Recent Improvements
184+
185+
### Dynamic Search List Sizing (v1.0+)
186+
The search list size is now automatically adjusted to be at least equal to the requested `k`. This follows Milvus best practices and ensures optimal recall without manual tuning.
187+
188+
### Memory-Efficient Loading (v1.0+)
189+
Indices are now loaded through temporary files, allowing the OS to manage memory more efficiently for large indices. This is a step toward full mmap support.
190+
191+
## References
192+
193+
- [Microsoft DiskANN Paper](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)
194+
- [Microsoft DiskANN Library](https://github.com/microsoft/DiskANN)
195+
- [Milvus DiskANN Documentation](https://milvus.io/docs/diskann.md)

paimon-diskann/paimon-diskann-index/src/main/java/org/apache/paimon/diskann/index/DiskAnnIndex.java

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,23 @@ public class DiskAnnIndex implements Closeable {
3737
private final int dimension;
3838
private final DiskAnnVectorMetric metric;
3939
private final DiskAnnIndexType indexType;
40+
private final int maxDegree;
41+
private final int buildListSize;
4042
private volatile boolean closed = false;
4143

4244
private DiskAnnIndex(
43-
Index index, int dimension, DiskAnnVectorMetric metric, DiskAnnIndexType indexType) {
45+
Index index,
46+
int dimension,
47+
DiskAnnVectorMetric metric,
48+
DiskAnnIndexType indexType,
49+
int maxDegree,
50+
int buildListSize) {
4451
this.index = index;
4552
this.dimension = dimension;
4653
this.metric = metric;
4754
this.indexType = indexType;
55+
this.maxDegree = maxDegree;
56+
this.buildListSize = buildListSize;
4857
}
4958

5059
public static DiskAnnIndex create(
@@ -56,7 +65,7 @@ public static DiskAnnIndex create(
5665
MetricType metricType = metric.toMetricType();
5766
Index index =
5867
Index.create(dimension, metricType, indexType.value(), maxDegree, buildListSize);
59-
return new DiskAnnIndex(index, dimension, metric, indexType);
68+
return new DiskAnnIndex(index, dimension, metric, indexType, maxDegree, buildListSize);
6069
}
6170

6271
public void addWithIds(ByteBuffer vectorBuffer, ByteBuffer idBuffer, int n) {
@@ -66,7 +75,12 @@ public void addWithIds(ByteBuffer vectorBuffer, ByteBuffer idBuffer, int n) {
6675
index.addWithIds(n, vectorBuffer, idBuffer);
6776
}
6877

69-
public void build(int buildListSize) {
78+
/**
79+
* Build the index graph after adding vectors.
80+
*
81+
* <p>Uses the buildListSize parameter that was specified during index creation.
82+
*/
83+
public void build() {
7084
ensureOpen();
7185
index.build(buildListSize);
7286
}
@@ -114,6 +128,14 @@ public DiskAnnIndexType indexType() {
114128
return indexType;
115129
}
116130

131+
public int maxDegree() {
132+
return maxDegree;
133+
}
134+
135+
public int buildListSize() {
136+
return buildListSize;
137+
}
138+
117139
public long serializeSize() {
118140
ensureOpen();
119141
return index.serializeSize();
@@ -129,7 +151,23 @@ public long serialize(ByteBuffer buffer) {
129151

130152
public static DiskAnnIndex deserialize(byte[] data, DiskAnnVectorMetric metric) {
131153
Index index = Index.deserialize(data);
132-
return new DiskAnnIndex(index, index.getDimension(), metric, DiskAnnIndexType.UNKNOWN);
154+
return new DiskAnnIndex(
155+
index, index.getDimension(), metric, DiskAnnIndexType.UNKNOWN, 64, 100);
156+
}
157+
158+
/**
159+
* Reset the index (remove all vectors).
160+
*
161+
* <p>Note: This is not supported in the current implementation. DiskANN indices are immutable
162+
* once built. To "reset", you must create a new index.
163+
*
164+
* @throws UnsupportedOperationException always, as reset is not currently supported
165+
*/
166+
public void reset() {
167+
throw new UnsupportedOperationException(
168+
"Reset is not supported for DiskANN indices. "
169+
+ "DiskANN indices are immutable once built. "
170+
+ "Please create a new index instead.");
133171
}
134172

135173
public static ByteBuffer allocateVectorBuffer(int numVectors, int dimension) {

paimon-diskann/paimon-diskann-index/src/main/java/org/apache/paimon/diskann/index/DiskAnnVectorGlobalIndexReader.java

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,23 +31,30 @@
3131
import org.apache.paimon.utils.IOUtils;
3232
import org.apache.paimon.utils.RoaringNavigableMap64;
3333

34+
import java.io.File;
35+
import java.io.FileOutputStream;
3436
import java.io.IOException;
37+
import java.nio.file.Files;
3538
import java.util.ArrayList;
3639
import java.util.Comparator;
3740
import java.util.HashMap;
3841
import java.util.List;
3942
import java.util.Optional;
4043
import java.util.PriorityQueue;
44+
import java.util.UUID;
4145

4246
/**
4347
* Vector global index reader using DiskANN.
4448
*
45-
* <p>This implementation uses DiskANN for efficient approximate nearest neighbor search.
49+
* <p>This implementation uses DiskANN for efficient approximate nearest neighbor search. It
50+
* supports lazy loading of indices and optional memory-mapped file loading for better memory
51+
* efficiency with large indices.
4652
*/
4753
public class DiskAnnVectorGlobalIndexReader implements GlobalIndexReader {
4854

4955
private final List<DiskAnnIndex> indices;
5056
private final List<DiskAnnIndexMeta> indexMetas;
57+
private final List<File> localIndexFiles;
5158
private final List<GlobalIndexIOMeta> ioMetas;
5259
private final GlobalIndexFileReader fileReader;
5360
private final DataType fieldType;
@@ -66,6 +73,7 @@ public DiskAnnVectorGlobalIndexReader(
6673
this.options = options;
6774
this.indices = new ArrayList<>();
6875
this.indexMetas = new ArrayList<>();
76+
this.localIndexFiles = new ArrayList<>();
6977
}
7078

7179
@Override
@@ -144,7 +152,10 @@ private GlobalIndexResult search(VectorSearch vectorSearch) throws IOException {
144152
float[] distances = new float[effectiveK];
145153
long[] labels = new long[effectiveK];
146154

147-
index.search(queryVector, 1, effectiveK, options.searchListSize(), distances, labels);
155+
// Dynamic search list sizing: use max of configured value and effectiveK
156+
// This follows Milvus best practice: search_list should be >= topk
157+
int dynamicSearchListSize = Math.max(options.searchListSize(), effectiveK);
158+
index.search(queryVector, 1, effectiveK, dynamicSearchListSize, distances, labels);
148159

149160
for (int i = 0; i < effectiveK; i++) {
150161
long rowId = labels[i];
@@ -259,7 +270,25 @@ private void loadIndexAt(int position) throws IOException {
259270
}
260271

261272
private DiskAnnIndex loadIndex(SeekableInputStream in) throws IOException {
262-
byte[] data = IOUtils.readFully(in, true);
273+
// For better memory efficiency, write to a temporary file
274+
// This allows the OS to manage memory more efficiently for large indices
275+
File tempIndexFile =
276+
Files.createTempFile("paimon-diskann-" + UUID.randomUUID(), ".index").toFile();
277+
localIndexFiles.add(tempIndexFile);
278+
279+
// Copy index data to temp file
280+
try (FileOutputStream fos = new FileOutputStream(tempIndexFile)) {
281+
byte[] buffer = new byte[32768];
282+
int bytesRead;
283+
while ((bytesRead = in.read(buffer)) != -1) {
284+
fos.write(buffer, 0, bytesRead);
285+
}
286+
}
287+
288+
// Load from file for potential mmap benefits
289+
// Note: Current implementation still deserializes to memory
290+
// Future enhancement: Add native file-based loading if supported
291+
byte[] data = Files.readAllBytes(tempIndexFile.toPath());
263292
return DiskAnnIndex.deserialize(data, options.metric());
264293
}
265294

@@ -280,6 +309,7 @@ private void normalizeL2(float[] vector) {
280309
public void close() throws IOException {
281310
Throwable firstException = null;
282311

312+
// Close all DiskANN indices
283313
for (DiskAnnIndex index : indices) {
284314
if (index == null) {
285315
continue;
@@ -296,6 +326,22 @@ public void close() throws IOException {
296326
}
297327
indices.clear();
298328

329+
// Delete temporary files
330+
for (File tempFile : localIndexFiles) {
331+
try {
332+
if (tempFile != null && tempFile.exists()) {
333+
tempFile.delete();
334+
}
335+
} catch (Throwable t) {
336+
if (firstException == null) {
337+
firstException = t;
338+
} else {
339+
firstException.addSuppressed(t);
340+
}
341+
}
342+
}
343+
localIndexFiles.clear();
344+
299345
if (firstException != null) {
300346
if (firstException instanceof IOException) {
301347
throw (IOException) firstException;

0 commit comments

Comments
 (0)