You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
meta_description: Learn how to select optimal HNSW hyperparameters by balancing recall and throughput. This guide explores portfolio learning, evaluation methods, and practical implementation in OpenSearch.
14
14
---
15
15
16
-
Vector search plays a crucial role in many machine learning (ML) and data science pipelines. In the context of large language models (LLMs), vector search powers [retrieval-augmented generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/), a technique that retrieves relevant content from a large document collection to improve LLM responses. Because finding exact k-nearest neighbors (k-NN) is computationally expensive on large datasets, approximate nearest neighbor (ANN) search methods, such as [Hierarchical Navigable Small Worlds (HNSW)](https://arxiv.org/pdf/1603.09320), are often used to improve efficiency [1].
16
+
Vector search plays a crucial role in many machine learning (ML) and data science pipelines. In the context of large language models (LLMs), vector search powers [retrieval-augmented generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/), a technique that retrieves relevant content from a large document collection to improve LLM responses. Because finding exact k-nearest neighbors (k-NN) is computationally expensive for large datasets, approximate nearest neighbor (ANN) search methods, such as [Hierarchical Navigable Small Worlds (HNSW)](https://arxiv.org/pdf/1603.09320), are often used to improve efficiency [1].
17
17
18
18
### Optimizing HNSW: Balancing search quality and speed
19
19
@@ -22,7 +22,7 @@ Configuring HNSW effectively is a multi-objective problem. This blog post focuse
22
22
-**Search quality**, measured by recall@k---the fraction of the top k ground truth neighbors that appear in the k results returned by HNSW.
23
23
-**Search speed**, measured by query throughput---the number of queries executed per second.
24
24
25
-
While index build time and index size are also important, we leave those aspects for future discussion.
25
+
While index build time and index size are also important, we will leave those aspects for future discussion.
26
26
27
27
The structure of the HNSW graph is controlled by its hyperparameters, which determine how densely vectors are connected. A denser graph generally improves recall but reduces query throughput, while a sparser graph has the opposite effect. Finding the right balance requires testing multiple configurations, yet there is limited guidance on how to do this efficiently.
28
28
@@ -32,13 +32,13 @@ The three most important hyperparameters in HNSW are:
32
32
33
33
-**`M`** – The maximum number of graph edges per vector. Higher values increase memory usage but may improve search quality.
34
34
-**`efSearch`** – The size of the candidate queue during search. Larger values may improve search quality but increase search latency.
35
-
-**`efConstruction`** – Similar to `efSearch`, but used during index construction. Higher values improve search quality but increase index build time.
35
+
-**`efConstruction`** – Similar to `efSearch` but used during index construction. Higher values improve search quality but increase index build time.
36
36
37
37
### Finding effective configurations
38
38
39
39
One approach to tuning these hyperparameters is **hyperparameter optimization (HPO)**, an automated technique that searches for the optimal configuration of a black-box function [5, 6]. However, HPO can be computationally expensive while providing limited benefits [3], especially in cases where the underlying algorithm is well understood.
40
40
41
-
An alternative is **transfer learning**, where knowledge from optimizing one dataset is applied to another. This approach helps identify configurations that approximate optimal results while maintaining efficiency [3, 4].
41
+
An alternative is **transfer learning**, where knowledge gained from optimizing one dataset is applied to another. This approach helps identify configurations that approximate optimal results while maintaining efficiency [3, 4].
42
42
43
43
### Recommended HNSW configurations
44
44
@@ -56,9 +56,9 @@ To optimize your search performance, you can **evaluate these five configuration
56
56
57
57
## Portfolio learning for HNSW
58
58
59
-
Portfolio learning [2, 3, 4] selects a set of complementary configurations so that at least one performs well on average when evaluated across different scenarios. Applying this approach to HNSW, we aim to identify a set of configurations that balance recall and query throughput.
59
+
Portfolio learning [2, 3, 4] selects a set of complementary configurations so that at least one performs well on average when evaluated across different scenarios. Applying this approach to HNSW, we aimed to identify a set of configurations that balance recall and query throughput.
60
60
61
-
To achieve this, we use 15 vector search datasets spanning various modalities, embedding models, and distance functions. For each dataset, we establish ground truth by computing the top 10 nearest neighbors for every query in the test set using exact k-NN search.
61
+
To achieve this, we used 15 vector search datasets spanning various modalities, embedding models, and distance functions, presented in the following table. For each dataset, we established ground truth by computing the top 10 nearest neighbors for every query in the test set using exact k-NN search.
For these experiments, we used an OpenSearch 2.15 cluster with three cluster manager nodes and six data nodes, each running on an `r6g.4xlarge.search` instance. We evaluated test vectors in batches of 100 and recorded query throughput and recall@10 for each HNSW configuration. In the next section, we'll introduce the algorithm used to learn the portfolio.
92
+
For these experiments, we used an OpenSearch 2.15 cluster with three cluster manager nodes and six data nodes, each running on an `r6g.4xlarge.search` instance. We evaluated test vectors in batches of 100 and recorded query throughput and recall@10 for each HNSW configuration. In the next section, we introduce the algorithm used to learn the portfolio.
93
93
94
94
### Method
95
95
96
96
To capture different trade-offs between recall and throughput, we used a simple linearization approach, assigning values between 0 and 1, inclusive, to both recall and throughput. Given a specific weighting, we identified the configuration that maximizes the linearized object using the following four steps:
97
97
98
98
1.**Normalize recall and throughput** – Apply min-max scaling within each dataset so that recall and throughput values are comparable.
99
-
2.**Compute weighted metric** – Combine the normalized recall and throughput using the assigned weights into a new weighted metric.
99
+
2.**Compute weighted metric** – Using the assigned weights, combine the normalized recall and throughput into a new weighted metric.
100
100
3.**Average across datasets** – Calculate the average weighted metric across datasets.
101
101
4.**Select the best configuration** – Identify the configuration that maximizes the average weighted metric.
102
102
@@ -113,38 +113,38 @@ We used the following weighting profiles for recall and throughput. We did not a
113
113
114
114
## Evaluation
115
115
116
-
We evaluated our method in two scenarios:
116
+
We evaluated our method using two scenarios:
117
117
118
-
1)**Leave-one-out evaluation** – One of the 15 datasets is used as the test dataset while the remaining datasets serve as the training set.
119
-
2)**Deployment evaluation** – All 15 datasets are used for training, and the method is tested on four additional datasets using a new embedding model, [Cohere-embed-english-v3](https://huggingface.co/Cohere/Cohere-embed-english-v3.0), which was not part of the training set.
118
+
1.**Leave-one-out evaluation** – One of the 15 datasets is used as the test dataset while the remaining datasets serve as the training set.
119
+
2.**Deployment evaluation** – All 15 datasets are used for training, and the method is tested on 4 additional datasets using a new embedding model, [Cohere-embed-english-v3](https://huggingface.co/Cohere/Cohere-embed-english-v3.0), that was not part of the training set.
120
120
121
-
The first scenario mimics cross-validation in machine learning, while the second simulates an evaluation in which the complete training dataset is used for model deployment.
121
+
The first scenario mimics cross-validation in ML, while the second simulates an evaluation in which the complete training dataset is used for model deployment.
122
122
123
123
### Leave-one-out evaluation
124
124
125
-
For this evaluation, we determined the ground-truth configurations under different weightings by applying our method to the test dataset. We then compared these with the predicted configurations derived from the training datasets using the same method.
125
+
For this evaluation, we determined the groundtruth configurations under different weightings by applying our method to the test dataset. We then compared these with the predicted configurations derived from the training datasets using the same method.
126
126
127
-
We calculated the mean absolute error (MAE) between the predicted and ground-truth configurations for normalized (min-max scaled) recall and throughput. The following bar plot shows the average MAE across all 15 datasets in the leave-one-out evaluation:
127
+
We calculated the mean absolute error (MAE) between the predicted and groundtruth configurations for normalized (min-max scaled) recall and throughput. The following bar plot shows the average MAE across all 15 datasets in the leave-one-out evaluation.
128
128
129
129
{:class="img-centered"}
130
130
131
-
The results show that the average MAEs for normalized recall are below 0.1. For context, if recall values range from 0.5 to 0.95 on a dataset, an MAE of 0.1 translates to a raw recall difference of only 0.045. This suggests that the predicted configurations closely match the ground-truth configurations, particularly for high-recall weightings.
131
+
The results show that the average MAEs for normalized recall are below 0.1. For context, if dataset recall values range from 0.5 to 0.95, an MAE of 0.1 translates to a raw recall difference of only 0.045. This suggests that the predicted configurations closely match the groundtruth configurations, particularly for high-recall weightings.
132
132
133
133
The MAEs for throughput are larger, likely because throughput measurements tend to be noisier than recall measurements. However, the MAEs decrease when higher weight is assigned to throughput.
134
134
135
135
### Deployment evaluation
136
136
137
-
For this evaluation, we applied our method to the 15 training datasets and tested the resulting configurations on three datasets using the Cohere-embed-english-v3 embedding model. Our goal was to ensure that the learned configurations align with the Pareto front, representing different trade-offs between recall and throughput.
137
+
For this evaluation, we applied our method to the 15 training datasets and tested the resulting configurations on 3 datasets using the Cohere-embed-english-v3 embedding model. Our goal was to ensure that the learned configurations align with the Pareto front, representing different trade-offs between recall and throughput.
138
138
139
-
The following plot depicts the recall and throughput for the learned configurations in different colors, with other configurations displayed in gray.
139
+
The following bar plot depicts the recall and throughput for the learned configurations in different colors, with other configurations displayed in gray.
140
140
141
141
{:class="img-centered"}
142
142
143
-
The results show that the five selected configurations effectively cover the high-recall and high-throughput regions. Since we did not assign high weights to throughput, the learned configurations do not extend into the low-recall, high-throughput area.
143
+
The results show that the five selected configurations effectively cover the high-recall and high-throughput regions. Because we did not assign high weights to throughput, the learned configurations do not extend into the low-recall, high-throughput area.
144
144
145
145
## How to apply the configurations in OpenSearch
146
146
147
-
To try out these configurations, first create an index. You must specify the index build parameters when creating the index because the parameters are not dynamic:
147
+
To try these configurations, first create an index. You must specify the index build parameters when creating the index because the parameters are not dynamic:
148
148
149
149
```json
150
150
curl -X PUT "localhost:9200/test-index" -H 'Content-Type: application/json' -d'
@@ -667,7 +667,7 @@ The following image depicts an example metric visualization.
667
667
668
668
## Limitations and future work
669
669
670
-
This blog post focused on optimizing for two key objectives: recall and throughput. However, to further adjust the size of the HNSW graph, exploring different values for `ef_construction` could provide additional insights.
670
+
This blog post focuses on optimizing HNSW for two key objectives: recall and throughput. However, to further adjust the size of the HNSW graph, exploring different values for `ef_construction` could provide additional insights.
671
671
672
672
Our method currently generates the same set of configurations across all datasets, but this approach can potentially be improved. By considering the specific characteristics of each dataset, we could create more targeted recommendations. Additionally, the current set of configurations is based on 15 datasets. Incorporating a broader range of datasets into the training process would enhance the generalization of the learned configurations.
0 commit comments