|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either express or implied. See the License for the |
| 16 | +specific language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# DiskANN Parameter Tuning Guide |
| 21 | + |
| 22 | +This document provides guidance on tuning DiskANN vector index parameters for optimal performance in Apache Paimon. |
| 23 | + |
| 24 | +## Overview |
| 25 | + |
| 26 | +DiskANN is a graph-based approximate nearest neighbor (ANN) search algorithm designed for efficient billion-point vector search. The implementation in Paimon provides several parameters to control the trade-offs between accuracy, speed, and resource usage. |
| 27 | + |
| 28 | +## Key Parameters |
| 29 | + |
| 30 | +### 1. Graph Construction Parameters |
| 31 | + |
| 32 | +#### `vector.diskann.max-degree` (R) |
| 33 | +- **Default**: 64 |
| 34 | +- **Range**: 32-128 |
| 35 | +- **Description**: Maximum degree (number of connections) for each node in the graph |
| 36 | +- **Impact**: |
| 37 | + - Higher values → Better recall, higher memory usage, longer build time |
| 38 | + - Lower values → Faster build, lower memory, potentially lower recall |
| 39 | +- **Recommendations**: |
| 40 | + - **32**: For memory-constrained environments or when build time is critical |
| 41 | + - **64**: Balanced default (Microsoft recommended) |
| 42 | + - **128**: For maximum recall when resources permit |
| 43 | + |
| 44 | +#### `vector.diskann.build-list-size` (L) |
| 45 | +- **Default**: 100 |
| 46 | +- **Range**: 50-200 |
| 47 | +- **Description**: Size of the candidate list during graph construction |
| 48 | +- **Impact**: |
| 49 | + - Higher values → Better graph quality, longer build time |
| 50 | + - Lower values → Faster build, potentially lower recall |
| 51 | +- **Recommendations**: |
| 52 | + - Use default 100 for most cases |
| 53 | + - Increase to 150-200 for very high-dimensional data (>512 dimensions) |
| 54 | + |
| 55 | +### 2. Search Parameters |
| 56 | + |
| 57 | +#### `vector.diskann.search-list-size` (L) |
| 58 | +- **Default**: 100 |
| 59 | +- **Range**: 16-500 |
| 60 | +- **Description**: Size of the candidate list during search |
| 61 | +- **Impact**: |
| 62 | + - Higher values → Better recall, higher latency |
| 63 | + - Lower values → Lower latency, potentially lower recall |
| 64 | +- **Dynamic Behavior**: The implementation automatically adjusts this to be at least equal to the requested `k` (number of results) |
| 65 | +- **Recommendations**: |
| 66 | + - **16-32**: For latency-critical applications (QPS > 5000) |
| 67 | + - **100**: Balanced default |
| 68 | + - **200-500**: For maximum recall (recall > 95%) |
| 69 | + |
| 70 | +#### `vector.search-factor` |
| 71 | +- **Default**: 10 |
| 72 | +- **Range**: 5-20 |
| 73 | +- **Description**: Multiplier for search limit when row filtering is applied |
| 74 | +- **Impact**: When filtering by row IDs, fetches `limit * search-factor` results to ensure sufficient matches after filtering |
| 75 | +- **Recommendations**: |
| 76 | + - **5**: When filtering is selective (<10% of data) |
| 77 | + - **10**: Default for typical filtering scenarios |
| 78 | + - **20**: When filtering is very broad (>50% of data) |
| 79 | + |
| 80 | +### 3. Data Configuration |
| 81 | + |
| 82 | +#### `vector.dim` |
| 83 | +- **Default**: 128 |
| 84 | +- **Description**: Dimension of the vectors |
| 85 | +- **Recommendations**: |
| 86 | + - Must match your embedding model |
| 87 | + - Common values: 128, 256, 384, 512, 768, 1024 |
| 88 | + |
| 89 | +#### `vector.metric` |
| 90 | +- **Default**: L2 |
| 91 | +- **Options**: L2, INNER_PRODUCT, COSINE |
| 92 | +- **Description**: Distance metric for similarity computation |
| 93 | +- **Recommendations**: |
| 94 | + - **L2**: For Euclidean distance (most common) |
| 95 | + - **INNER_PRODUCT**: For dot product similarity (use with normalized vectors) |
| 96 | + - **COSINE**: For cosine similarity |
| 97 | + |
| 98 | +#### `vector.normalize` |
| 99 | +- **Default**: false |
| 100 | +- **Description**: Whether to L2-normalize vectors before indexing/searching |
| 101 | +- **Recommendations**: |
| 102 | + - **true**: When using COSINE metric or when vectors have varying magnitudes |
| 103 | + - **false**: When vectors are already normalized or using L2 metric |
| 104 | + |
| 105 | +### 4. Index Organization |
| 106 | + |
| 107 | +#### `vector.size-per-index` |
| 108 | +- **Default**: 2,000,000 |
| 109 | +- **Description**: Number of vectors per index file |
| 110 | +- **Impact**: |
| 111 | + - Larger values → Fewer files, higher memory per index, better search efficiency |
| 112 | + - Smaller values → More files, lower memory per index, more overhead |
| 113 | +- **Recommendations**: |
| 114 | + - **500,000**: For small datasets or memory-constrained environments |
| 115 | + - **2,000,000**: Default for balanced performance |
| 116 | + - **5,000,000+**: For large-scale production systems with ample resources |
| 117 | + |
| 118 | +#### `vector.diskann.index-type` |
| 119 | +- **Default**: MEMORY |
| 120 | +- **Options**: MEMORY, DISK |
| 121 | +- **Description**: Type of index structure |
| 122 | +- **Recommendations**: |
| 123 | + - **MEMORY**: For datasets that fit in RAM (best performance) |
| 124 | + - **DISK**: For datasets exceeding RAM (requires SSD) |
| 125 | + |
| 126 | +## Performance Tuning Guide |
| 127 | + |
| 128 | +### High Recall (>95%) |
| 129 | +```properties |
| 130 | +vector.diskann.max-degree = 128 |
| 131 | +vector.diskann.build-list-size = 150 |
| 132 | +vector.diskann.search-list-size = 200 |
| 133 | +``` |
| 134 | + |
| 135 | +### Balanced (90-95% recall) |
| 136 | +```properties |
| 137 | +vector.diskann.max-degree = 64 |
| 138 | +vector.diskann.build-list-size = 100 |
| 139 | +vector.diskann.search-list-size = 100 |
| 140 | +``` |
| 141 | + |
| 142 | +### High QPS (Low Latency) |
| 143 | +```properties |
| 144 | +vector.diskann.max-degree = 32 |
| 145 | +vector.diskann.build-list-size = 75 |
| 146 | +vector.diskann.search-list-size = 32 |
| 147 | +``` |
| 148 | + |
| 149 | +### Memory-Constrained |
| 150 | +```properties |
| 151 | +vector.diskann.max-degree = 32 |
| 152 | +vector.diskann.build-list-size = 75 |
| 153 | +vector.size-per-index = 500000 |
| 154 | +vector.diskann.index-type = DISK |
| 155 | +``` |
| 156 | + |
| 157 | +## Best Practices |
| 158 | + |
| 159 | +1. **Start with defaults**: The default parameters are tuned for balanced performance |
| 160 | +2. **Measure first**: Profile your workload before tuning |
| 161 | +3. **Tune incrementally**: Change one parameter at a time and measure impact |
| 162 | +4. **Consider trade-offs**: Higher recall typically means higher latency and resource usage |
| 163 | +5. **Test with production data**: Parameter effectiveness depends on data characteristics |
| 164 | + |
| 165 | +## Advanced Parameters (Future Enhancement) |
| 166 | + |
| 167 | +The following parameters are documented in the official Microsoft DiskANN implementation but are not yet exposed in the current Rust-based native library: |
| 168 | + |
| 169 | +- **alpha** (default: 1.2): Controls the graph construction pruning strategy |
| 170 | +- **saturate_graph** (default: true): Whether to saturate the graph during construction |
| 171 | + |
| 172 | +These parameters may be added in future versions when the underlying Rust DiskANN crate exposes them through its configuration API. |
| 173 | + |
| 174 | +## Performance Metrics |
| 175 | + |
| 176 | +When tuning parameters, monitor these metrics: |
| 177 | +- **Recall**: Percentage of true nearest neighbors found |
| 178 | +- **QPS (Queries Per Second)**: Throughput of search operations |
| 179 | +- **Latency**: Time to complete a single query (p50, p95, p99) |
| 180 | +- **Memory Usage**: RAM consumed by indices |
| 181 | +- **Build Time**: Time to construct the index |
| 182 | + |
| 183 | +## Recent Improvements |
| 184 | + |
| 185 | +### Dynamic Search List Sizing (v1.0+) |
| 186 | +The search list size is now automatically adjusted to be at least equal to the requested `k`. This follows Milvus best practices and ensures optimal recall without manual tuning. |
| 187 | + |
| 188 | +### Memory-Efficient Loading (v1.0+) |
| 189 | +Indices are now loaded through temporary files, allowing the OS to manage memory more efficiently for large indices. This is a step toward full mmap support. |
| 190 | + |
| 191 | +## References |
| 192 | + |
| 193 | +- [Microsoft DiskANN Paper](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) |
| 194 | +- [Microsoft DiskANN Library](https://github.com/microsoft/DiskANN) |
| 195 | +- [Milvus DiskANN Documentation](https://milvus.io/docs/diskann.md) |
0 commit comments