Skip to content

Commit ed35bd6

Browse files
Apply suggestions from code review
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
1 parent 09ef2b0 commit ed35bd6

File tree

1 file changed

+25
-25
lines changed

1 file changed

+25
-25
lines changed

_posts/2025-02-28-a-practical-guide-for-selecting-HNSW-hyperparameters.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: post
3-
title: "A practical guide for selecting HNSW hyperparameters"
3+
title: "A practical guide to selecting HNSW hyperparameters"
44
authors:
55
- huibishe
66
- jmazane
@@ -13,7 +13,7 @@ meta_keywords: HNSW, hyperparameters
1313
meta_description: Learn how to select optimal HNSW hyperparameters by balancing recall and throughput. This guide explores portfolio learning, evaluation methods, and practical implementation in OpenSearch.
1414
---
1515

16-
Vector search plays a crucial role in many machine learning (ML) and data science pipelines. In the context of large language models (LLMs), vector search powers [retrieval-augmented generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/), a technique that retrieves relevant content from a large document collection to improve LLM responses. Because finding exact k-nearest neighbors (k-NN) is computationally expensive on large datasets, approximate nearest neighbor (ANN) search methods, such as [Hierarchical Navigable Small Worlds (HNSW)](https://arxiv.org/pdf/1603.09320), are often used to improve efficiency [1].
16+
Vector search plays a crucial role in many machine learning (ML) and data science pipelines. In the context of large language models (LLMs), vector search powers [retrieval-augmented generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/), a technique that retrieves relevant content from a large document collection to improve LLM responses. Because finding exact k-nearest neighbors (k-NN) is computationally expensive for large datasets, approximate nearest neighbor (ANN) search methods, such as [Hierarchical Navigable Small Worlds (HNSW)](https://arxiv.org/pdf/1603.09320), are often used to improve efficiency [1].
1717

1818
### Optimizing HNSW: Balancing search quality and speed
1919

@@ -22,7 +22,7 @@ Configuring HNSW effectively is a multi-objective problem. This blog post focuse
2222
- **Search quality**, measured by recall@k---the fraction of the top k ground truth neighbors that appear in the k results returned by HNSW.
2323
- **Search speed**, measured by query throughput---the number of queries executed per second.
2424

25-
While index build time and index size are also important, we leave those aspects for future discussion.
25+
While index build time and index size are also important, we will leave those aspects for future discussion.
2626

2727
The structure of the HNSW graph is controlled by its hyperparameters, which determine how densely vectors are connected. A denser graph generally improves recall but reduces query throughput, while a sparser graph has the opposite effect. Finding the right balance requires testing multiple configurations, yet there is limited guidance on how to do this efficiently.
2828

@@ -32,13 +32,13 @@ The three most important hyperparameters in HNSW are:
3232

3333
- **`M`** – The maximum number of graph edges per vector. Higher values increase memory usage but may improve search quality.
3434
- **`efSearch`** – The size of the candidate queue during search. Larger values may improve search quality but increase search latency.
35-
- **`efConstruction`** – Similar to `efSearch`, but used during index construction. Higher values improve search quality but increase index build time.
35+
- **`efConstruction`** – Similar to `efSearch` but used during index construction. Higher values improve search quality but increase index build time.
3636

3737
### Finding effective configurations
3838

3939
One approach to tuning these hyperparameters is **hyperparameter optimization (HPO)**, an automated technique that searches for the optimal configuration of a black-box function [5, 6]. However, HPO can be computationally expensive while providing limited benefits [3], especially in cases where the underlying algorithm is well understood.
4040

41-
An alternative is **transfer learning**, where knowledge from optimizing one dataset is applied to another. This approach helps identify configurations that approximate optimal results while maintaining efficiency [3, 4].
41+
An alternative is **transfer learning**, where knowledge gained from optimizing one dataset is applied to another. This approach helps identify configurations that approximate optimal results while maintaining efficiency [3, 4].
4242

4343
### Recommended HNSW configurations
4444

@@ -56,9 +56,9 @@ To optimize your search performance, you can **evaluate these five configuration
5656

5757
## Portfolio learning for HNSW
5858

59-
Portfolio learning [2, 3, 4] selects a set of complementary configurations so that at least one performs well on average when evaluated across different scenarios. Applying this approach to HNSW, we aim to identify a set of configurations that balance recall and query throughput.
59+
Portfolio learning [2, 3, 4] selects a set of complementary configurations so that at least one performs well on average when evaluated across different scenarios. Applying this approach to HNSW, we aimed to identify a set of configurations that balance recall and query throughput.
6060

61-
To achieve this, we use 15 vector search datasets spanning various modalities, embedding models, and distance functions. For each dataset, we establish ground truth by computing the top 10 nearest neighbors for every query in the test set using exact k-NN search.
61+
To achieve this, we used 15 vector search datasets spanning various modalities, embedding models, and distance functions, presented in the following table. For each dataset, we established ground truth by computing the top 10 nearest neighbors for every query in the test set using exact k-NN search.
6262

6363
| Dataset | Dimensions | Train size | Test size | Neighbors | Distance | Embedding | Domain |
6464
|:---------------|:-------------|:-------------|:------------|------------:|:--------------|:-----------------------------------------------|:------------------------------|
@@ -89,14 +89,14 @@ search_space = {
8989
}
9090
```
9191

92-
For these experiments, we used an OpenSearch 2.15 cluster with three cluster manager nodes and six data nodes, each running on an `r6g.4xlarge.search` instance. We evaluated test vectors in batches of 100 and recorded query throughput and recall@10 for each HNSW configuration. In the next section, we'll introduce the algorithm used to learn the portfolio.
92+
For these experiments, we used an OpenSearch 2.15 cluster with three cluster manager nodes and six data nodes, each running on an `r6g.4xlarge.search` instance. We evaluated test vectors in batches of 100 and recorded query throughput and recall@10 for each HNSW configuration. In the next section, we introduce the algorithm used to learn the portfolio.
9393

9494
### Method
9595

9696
To capture different trade-offs between recall and throughput, we used a simple linearization approach, assigning values between 0 and 1, inclusive, to both recall and throughput. Given a specific weighting, we identified the configuration that maximizes the linearized object using the following four steps:
9797

9898
1. **Normalize recall and throughput** – Apply min-max scaling within each dataset so that recall and throughput values are comparable.
99-
2. **Compute weighted metric**Combine the normalized recall and throughput using the assigned weights into a new weighted metric.
99+
2. **Compute weighted metric**Using the assigned weights, combine the normalized recall and throughput into a new weighted metric.
100100
3. **Average across datasets** – Calculate the average weighted metric across datasets.
101101
4. **Select the best configuration** – Identify the configuration that maximizes the average weighted metric.
102102

@@ -113,38 +113,38 @@ We used the following weighting profiles for recall and throughput. We did not a
113113

114114
## Evaluation
115115

116-
We evaluated our method in two scenarios:
116+
We evaluated our method using two scenarios:
117117

118-
1) **Leave-one-out evaluation** – One of the 15 datasets is used as the test dataset while the remaining datasets serve as the training set.
119-
2) **Deployment evaluation** – All 15 datasets are used for training, and the method is tested on four additional datasets using a new embedding model, [Cohere-embed-english-v3](https://huggingface.co/Cohere/Cohere-embed-english-v3.0), which was not part of the training set.
118+
1. **Leave-one-out evaluation** – One of the 15 datasets is used as the test dataset while the remaining datasets serve as the training set.
119+
2. **Deployment evaluation** – All 15 datasets are used for training, and the method is tested on 4 additional datasets using a new embedding model, [Cohere-embed-english-v3](https://huggingface.co/Cohere/Cohere-embed-english-v3.0), that was not part of the training set.
120120

121-
The first scenario mimics cross-validation in machine learning, while the second simulates an evaluation in which the complete training dataset is used for model deployment.
121+
The first scenario mimics cross-validation in ML, while the second simulates an evaluation in which the complete training dataset is used for model deployment.
122122

123123
### Leave-one-out evaluation
124124

125-
For this evaluation, we determined the ground-truth configurations under different weightings by applying our method to the test dataset. We then compared these with the predicted configurations derived from the training datasets using the same method.
125+
For this evaluation, we determined the ground truth configurations under different weightings by applying our method to the test dataset. We then compared these with the predicted configurations derived from the training datasets using the same method.
126126

127-
We calculated the mean absolute error (MAE) between the predicted and ground-truth configurations for normalized (min-max scaled) recall and throughput. The following bar plot shows the average MAE across all 15 datasets in the leave-one-out evaluation:
127+
We calculated the mean absolute error (MAE) between the predicted and ground truth configurations for normalized (min-max scaled) recall and throughput. The following bar plot shows the average MAE across all 15 datasets in the leave-one-out evaluation.
128128

129129
![MAE compared with ground truth](/assets/media/blog-images/2025-02-28-a-pratical-guide-for-selecting-HNSW-hyperparameters/mae.png){:class="img-centered"}
130130

131-
The results show that the average MAEs for normalized recall are below 0.1. For context, if recall values range from 0.5 to 0.95 on a dataset, an MAE of 0.1 translates to a raw recall difference of only 0.045. This suggests that the predicted configurations closely match the ground-truth configurations, particularly for high-recall weightings.
131+
The results show that the average MAEs for normalized recall are below 0.1. For context, if dataset recall values range from 0.5 to 0.95, an MAE of 0.1 translates to a raw recall difference of only 0.045. This suggests that the predicted configurations closely match the ground truth configurations, particularly for high-recall weightings.
132132

133133
The MAEs for throughput are larger, likely because throughput measurements tend to be noisier than recall measurements. However, the MAEs decrease when higher weight is assigned to throughput.
134134

135135
### Deployment evaluation
136136

137-
For this evaluation, we applied our method to the 15 training datasets and tested the resulting configurations on three datasets using the Cohere-embed-english-v3 embedding model. Our goal was to ensure that the learned configurations align with the Pareto front, representing different trade-offs between recall and throughput.
137+
For this evaluation, we applied our method to the 15 training datasets and tested the resulting configurations on 3 datasets using the Cohere-embed-english-v3 embedding model. Our goal was to ensure that the learned configurations align with the Pareto front, representing different trade-offs between recall and throughput.
138138

139-
The following plot depicts the recall and throughput for the learned configurations in different colors, with other configurations displayed in gray.
139+
The following bar plot depicts the recall and throughput for the learned configurations in different colors, with other configurations displayed in gray.
140140

141141
![Trade-off of the 5 configurations](/assets/media/blog-images/2025-02-28-a-pratical-guide-for-selecting-HNSW-hyperparameters/tradeoff.png){:class="img-centered"}
142142

143-
The results show that the five selected configurations effectively cover the high-recall and high-throughput regions. Since we did not assign high weights to throughput, the learned configurations do not extend into the low-recall, high-throughput area.
143+
The results show that the five selected configurations effectively cover the high-recall and high-throughput regions. Because we did not assign high weights to throughput, the learned configurations do not extend into the low-recall, high-throughput area.
144144

145145
## How to apply the configurations in OpenSearch
146146

147-
To try out these configurations, first create an index. You must specify the index build parameters when creating the index because the parameters are not dynamic:
147+
To try these configurations, first create an index. You must specify the index build parameters when creating the index because the parameters are not dynamic:
148148

149149
```json
150150
curl -X PUT "localhost:9200/test-index" -H 'Content-Type: application/json' -d'
@@ -187,7 +187,7 @@ curl -X PUT "localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
187187
'
188188
```
189189

190-
Last, run a search:
190+
Lastly, run a search:
191191

192192
```json
193193
curl -X GET "localhost:9200/test-index/_search?pretty&_source_excludes=my_vector" -H 'Content-Type: application/json' -d'
@@ -603,10 +603,10 @@ def eval_config(
603603

604604
#### Define the variables for your experiments
605605

606-
For your experiments, define the following variables:
606+
Define the following variables for your experiments:
607607

608608
- OpenSearch domain and engine
609-
- AWS region and AWS profile
609+
- AWS Region and AWS profile
610610
- Local file path
611611
- Dataset dimension
612612
- Space to use in HNSW
@@ -639,7 +639,7 @@ for i, config in enumerate([
639639
metrics.append(metric)
640640
```
641641

642-
You will see following output in the terminal:
642+
You will see the following output in the terminal:
643643

644644
```
645645
Indexing vectors: 100%|██████████| 8674/8674 [00:26<00:00, 321.72it/s]
@@ -667,7 +667,7 @@ The following image depicts an example metric visualization.
667667

668668
## Limitations and future work
669669

670-
This blog post focused on optimizing for two key objectives: recall and throughput. However, to further adjust the size of the HNSW graph, exploring different values for `ef_construction` could provide additional insights.
670+
This blog post focuses on optimizing HNSW for two key objectives: recall and throughput. However, to further adjust the size of the HNSW graph, exploring different values for `ef_construction` could provide additional insights.
671671

672672
Our method currently generates the same set of configurations across all datasets, but this approach can potentially be improved. By considering the specific characteristics of each dataset, we could create more targeted recommendations. Additionally, the current set of configurations is based on 15 datasets. Incorporating a broader range of datasets into the training process would enhance the generalization of the learned configurations.
673673

0 commit comments

Comments
 (0)