Skip to content

Commit 997cb46

Browse files
committed
Fixups, pt. III
1 parent 8e4517e commit 997cb46

File tree

8 files changed

+102
-67
lines changed

8 files changed

+102
-67
lines changed

docs/en/engines/table-engines/mergetree-family/annindexes.md

Lines changed: 45 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ ClickHouse provides a special "vector similarity" index to perform approximate n
7878
:::note
7979
Vector similarity indexes are currently experimental.
8080
To enable them, please first run `SET allow_experimental_vector_similarity_index = 1`.
81-
If you run into problems, kindly open an issue at github.com/clickhouse/clickhouse/issues.
81+
If you run into problems, kindly open an issue in the [ClickHouse repository](github.com/clickhouse/clickhouse/issues).
8282
:::
8383

8484
### Creating a Vector Similarity Index {#creating-a-vector-similarity-index}
@@ -174,13 +174,13 @@ LIMIT <N>
174174
ClickHouse's query optimizer tries to match above query template and make use of available vector similarity indexes.
175175
A query can only use a vector similarity index if the distance function in the SELECT query is the same as the distance function in the index definition.
176176

177-
Advanced users may provide a custom value for setting [hnsw_candidate_list_size_for_search](../../../operations/settings/settings.md#hnsw_candidate_list_size_for_search) (also know as HNSW hyperparameter `ef_search`) to tune the size of the candidate list during search (e.g. `SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>`).
177+
Advanced users may provide a custom value for setting [hnsw_candidate_list_size_for_search](../../../operations/settings/settings.md#hnsw_candidate_list_size_for_search) (also know as HNSW hyperparameter "ef_search") to tune the size of the candidate list during search (e.g. `SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>`).
178178
The default value of the setting 256 works well in the majority of use cases.
179179
Higher setting values mean better accuracy at the cost of slower performance.
180180

181181
If the query can use a vector similarity index, ClickHouse checks that the LIMIT `<N>` provided in SELECT queries is within reasonable bounds.
182182
More specifically, an error is returned if `<N>` is bigger than the value of setting [max_limit_for_vector_search_queries](../../../operations/settings/settings.md#max_limit_for_vector_search_queries) with default value 100.
183-
Too large LIMITs can slow down searches and usually indicate a usage error.
183+
Too large LIMIT values can slow down searches and usually indicate a usage error.
184184

185185
To check if a SELECT query uses a vector similarity index, you can prefix the query with `EXPLAIN indexes = 1`.
186186

@@ -231,21 +231,22 @@ To enforce index usage, you can run the SELECT query with setting [force_data_sk
231231

232232
**Post-filtering and Pre-filtering**
233233

234-
Users may optionally specify a `WHERE` clause with additional filter conditions in SELECT queries.
235-
Depending on these filter conditions, ClickHouse will utilize post-filtering or pre-filtering.
236-
These two strategies determine the order in which the filters are evaluated:
237-
- With post-filtering, the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified of the `WHERE` clause.
238-
- With pre-filtering, the filter evaluation order is the other way round.
234+
Users may optionally specify a `WHERE` clause with additional filter conditions for the SELECT query.
235+
ClickHouse will evaluate these filter conditions using post-filtering or pre-filtering strategy.
236+
In short, both strategies determine the order in which the filters are evaluated:
237+
- Post-filtering means that the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified in the `WHERE` clause.
238+
- Pre-filtering means that the filter evaluation order is the other way round.
239239

240-
Both strategies have different trade-offs:
241-
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This situation happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters.
242-
- Pre-filtering is generally an unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
240+
The strategies have different trade-offs:
241+
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This situation happens when one or more result rows returned by the vector similarity index fails to satisfy the additional filters.
242+
- Pre-filtering is generally an unsolved problem. Certain specialized vector databases provide pre-filtering algorithms but most relational databases (including ClickHouse) will fall back to exact neighbor search, i.e., a brute-force scan without index.
243243

244-
What strategy is used comes down to whether ClickHouse can use indexes for the additional filter conditions.
245-
If no index can be used, post-filtering will be applied.
244+
What strategy is used depends on the filter condition.
245+
246+
*Additional filters are part of the partition key*
246247

247248
If the additional filter condition is part of the partition key, then ClickHouse will apply partition pruning.
248-
For example, assuming that the table is range-partitioned by `year`:
249+
As an example, a table is range-partitioned by column `year` and the following query is run:
249250

250251
```sql
251252
WITH [0., 2.] AS reference_vec
@@ -256,20 +257,30 @@ ORDER BY L2Distance(vec, reference_vec) ASC
256257
LIMIT 3;
257258
```
258259

259-
ClickHouse will ignore all partitions but the one for year 2025.
260-
Within this partition, a post-filtering strategy will be applied.
260+
ClickHouse will prune all partitions except the 2025 one.
261+
262+
*Additional filters cannot be evaluated using indexes*
263+
264+
If additional filter conditions cannot be evaluated using indexes (primary key index, skipping index), ClickHouse will apply post-filtering.
265+
266+
*Additional filters can be evaluated using the primary key index*
261267

262-
If the additional filter condition is on the primary key columns and the filter selects some but not all ranges of a part, then Clickhouse will fall back to exact neighbour search (brute-force scan without index) on the selected ranges of the part.
263-
If the primary key filter selects entire parts, Clickhouse will use the vector similarity index on those parts to retrieve results.
268+
If additional filter conditions can be evaluated using the [primary key](mergetree.md#primary-key) (i.e., they form a prefix of the primary key) and
269+
- the filter condition eliminates at least one row within a part, the ClickHouse will fall back to pre-filtering for the "surviving" ranges within the part,
270+
- the filter condition eliminates no rows within a part, the ClickHouse will perform post-filtering for the part.
264271

265-
In case additional filter conditions on columns can make use of skip indexes (minmax, set etc), Clickhouse by default chooses a post-filtering strategy.
266-
Clickhouse gives higher priority to the vector similarity index because the vector index is expected to deliver business value by accelerating semantic search response times.
272+
In practical use cases, the latter case is rather unlikely.
267273

268-
Clickhouse provides 2 settings for finer control on post-filtering and pre-filtering:
274+
*Additional filters can be evaluated using skipping index*
269275

270-
When the additional filter conditions are extremely selective, it is possible that brute force search on a small filtered set of rows gives better results then post-filtering using the vector search.
271-
Users can force pre-filtering by setting [vector_search_filter_strategy](../../../operations/settings/settings#vector_search_filter_strategy) to `prefilter` (default is `auto` which is equivalent to `postfilter`).
272-
An example query where pre-filtering could be a good choice is
276+
If additional filter conditions can be evaluated using [skipping indexes](mergetree.md#table_engine-mergetree-data_skipping-indexes) (minmax index, set index, etc.), Clickhouse performs post-filtering.
277+
In such cases, the vector similarity index is evaluated first as it is expected to remove the most rows relative to other skipping indexes.
278+
279+
For finer control over post-filtering vs. pre-filtering, two settings can be used:
280+
281+
Setting [vector_search_filter_strategy](../../../operations/settings/settings#vector_search_filter_strategy) (default: `auto` which implements above heuristics) may be set to `prefilter`.
282+
This is useful to force pre-filtering in cases where the additional filter conditions are extremely selective.
283+
As an example, the following query may benefit from pre-filtering:
273284

274285
```sql
275286
SELECT bookid, author, title
@@ -279,24 +290,25 @@ ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empire
279290
LIMIT 10
280291
```
281292

282-
Assuming that only very few books cost less than $2, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above $2.
283-
By forcing pre-filtering (add `SETTINGS vector_search_filter_strategy = 'prefilter'` to the query), ClickHouse first finds all books with a price of less than $2 and then executes a brute-force vector search on the matches.
293+
Assuming that only a very small number of books cost less than $2, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above $2.
294+
By forcing pre-filtering (add `SETTINGS vector_search_filter_strategy = 'prefilter'` to the query), ClickHouse first finds all books with a price of less than $2 and then executes a brute-force vector search for the found books.
284295

285-
As mentioned above, post-filtering may return less matches then specified in the `LIMIT <N>` clause.
286-
Consider query
296+
As an alternative approach to resolve above issue, setting [vector_search_postfilter_multiplier](../../../operations/settings/settings#vector_search_postfilter_multiplier) (default: `1.0`) may be configured to a value > `1.0` (for example, `2.0`).
297+
The number of nearest neighbors fetched from the vector index is multiplied by the setting value and then the additional filter to be applied on those rows to return LIMIT-many rows.
298+
As an example, we can query again but with multiplier `3.0`:
287299

288300
```sql
289301
SELECT bookid, author, title
290302
FROM books
291-
WHERE published_year <= 2000
303+
WHERE price < 2.00
292304
ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empires'))
293305
LIMIT 10
306+
SETTING vector_search_postfilter_multiplier = 3.0;
294307
```
295308

296-
With post-filtering, some of the 10 nearest matching books returned by the vector index may be pruned from the result because they were published later than in the year 2000.
297-
As a result, the query may return less rows than the user requested.
298-
For such cases, you can set parameter [vector_search_postfilter_multiplier](../../../operations/settings/settings#vector_search_postfilter_multiplier) to a value > 1.0 (for example, 2.0) to indicate that N times this factor many matches should be returned by the vector index and then the additional filter to be applied on those rows to return the result of 10 rows.
299-
We note that this method can mitigate the problem with post-filtering but in extreme cases (extremely selective WHERE condition), there may still less than N requested rows returned.
309+
ClickHouse will fetch 3.0 x 10 = 30 nearest neighbors from the vector index in each part and afterwards evaluate the additional filters.
310+
Only the ten closest neighbors will be returned.
311+
We note that setting `vector_search_postfilter_multiplier` can mitigate the problem but in extreme cases (very selective WHERE condition), it is still possible that less than N requested rows returned.
300312

301313
### Performance Tuning {#performance-tuning}
302314

src/Core/Settings.cpp

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6567,10 +6567,7 @@ Enable experimental hash functions
65676567
Allow the obsolete Object data type
65686568
)", EXPERIMENTAL) \
65696569
DECLARE(Bool, allow_experimental_time_series_table, false, R"(
6570-
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine.
6571-
6572-
Possible values:
6573-
6570+
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine. Possible values:
65746571
- 0 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled.
65756572
- 1 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled.
65766573
)", EXPERIMENTAL) \
@@ -6587,13 +6584,10 @@ SELECT queries with LIMIT bigger than this setting cannot use vector similarity
65876584
The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'.
65886585
)", EXPERIMENTAL) \
65896586
DECLARE(VectorSearchFilterStrategy, vector_search_filter_strategy, VectorSearchFilterStrategy::AUTO, R"(
6590-
If a vector search query has a WHERE clause, this setting determines if it is evaluated first (pre-filtering) OR if the vector similarity index is checked first (post-filtering).
6591-
6592-
Possible values:
6593-
6594-
'auto' - Postfiltering (the exact semantics may change in future).
6595-
'postfilter' - Use vector similarity index to identify the nearest neighbours, then apply other filters
6596-
'prefilter' - Evaluate other filters first, then perform brute-force search to identify neighbours.
6587+
If a vector search query has a WHERE clause, this setting determines if it is evaluated first (pre-filtering) OR if the vector similarity index is checked first (post-filtering). Possible values:
6588+
- 'auto' - Postfiltering (the exact semantics may change in future).
6589+
- 'postfilter' - Use vector similarity index to identify the nearest neighbours, then apply other filters
6590+
- 'prefilter' - Evaluate other filters first, then perform brute-force search to identify neighbours.
65976591
)", EXPERIMENTAL) \
65986592
DECLARE(Float, vector_search_postfilter_multiplier, 1.0, R"(
65996593
Multiply the fetched nearest neighbors from the vector similarity index by this number before performing post-filtering on other predicates.

src/Storages/MergeTree/MergeTreeIndexVectorSimilarity.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -467,7 +467,7 @@ std::vector<UInt64> MergeTreeIndexConditionVectorSimilarity::calculateApproximat
467467

468468
size_t limit = parameters->limit;
469469
if (parameters->additional_filters_present)
470-
/// Post-filters may remove matches. Allow to fetch more rows by a factor to compensate.
470+
/// Additional filters mean post-filtering which means that matches may be removed. To compensate, allow to fetch more rows by a factor.
471471
limit = std::min(static_cast<size_t>(limit * postfilter_multiplier), max_limit);
472472

473473
/// We want to run the search with the user-provided value for setting hnsw_candidate_list_size_for_search (aka. expansion_search).

src/Storages/MergeTree/MergeTreeIndices.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ struct VectorSearchParameters
105105
std::vector<Float64> reference_vector;
106106

107107
/// Other metadata
108-
bool additional_filters_present;
108+
bool additional_filters_present; /// SELECT contains a WHERE or PREWHERE clause
109109
};
110110

111111
/// Stores some info about a single block of data.

tests/queries/0_stateless/02354_vector_search_postfiltering_bug.reference

Whitespace-only changes.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
-- Tags: no-fasttest, long, no-asan, no-ubsan, no-debug
2+
-- Test for Bug 78161
3+
4+
SET allow_experimental_vector_similarity_index = 1;
5+
SET enable_analyzer = 1;
6+
7+
CREATE TABLE tab (id Int32, vec Array(Float32)) ENGINE = MergeTree() ORDER BY id SETTINGS index_granularity = 128;
8+
INSERT INTO tab SELECT number, [randCanonical(), randCanonical()] FROM numbers(100000);
9+
10+
-- Create index
11+
ALTER TABLE tab ADD INDEX idx_vec vec TYPE vector_similarity('hnsw', 'cosineDistance', 2, 'f32', 64, 400);
12+
ALTER TABLE tab MATERIALIZE INDEX idx_vec SETTINGS mutations_sync=2;
13+
14+
WITH [1., 2.] AS reference_vec
15+
SELECT *
16+
FROM tab
17+
PREWHERE id < 5000
18+
ORDER BY cosineDistance(vec, reference_vec) ASC
19+
LIMIT 10
20+
FORMAT Null;

tests/queries/0_stateless/02354_vector_search_pre_and_post_filtering.reference

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ The first 3 neighbours returned by vector index dont pass the attr2 >= 1008 filt
2222
10
2323
11
2424
12
25+
-- Negative parameter values throw an exception

tests/queries/0_stateless/02354_vector_search_pre_and_post_filtering.sql

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
-- Tags: no-fasttest, no-ordinary-database
22

3-
-- Tests for vector search and pre-filtering/post-filtering.
3+
-- Tests pre vs. post-filtering for vector search.
44

55
SET allow_experimental_vector_similarity_index = 1;
66
SET enable_analyzer = 1;
@@ -11,15 +11,15 @@ DROP TABLE IF EXISTS tab;
1111
CREATE TABLE tab
1212
(
1313
id Int32,
14-
dt Date,
14+
date Date,
1515
attr1 Int32,
1616
attr2 Int32,
17-
vector Array(Float32),
18-
INDEX attr1_index attr1 TYPE minmax,
19-
INDEX vector_index vector TYPE vector_similarity('hnsw', 'L2Distance', 2) GRANULARITY 10000
17+
vec Array(Float32),
18+
INDEX idx_attr1 attr1 TYPE minmax,
19+
INDEX idx_vec vec TYPE vector_similarity('hnsw', 'L2Distance', 2) GRANULARITY 10000
2020
)
2121
ENGINE = MergeTree
22-
PARTITION BY dt
22+
PARTITION BY date
2323
ORDER BY id
2424
SETTINGS index_granularity = 3;
2525

@@ -44,7 +44,7 @@ SELECT trimLeft(explain) FROM (
4444
EXPLAIN indexes = 1
4545
SELECT id
4646
FROM tab
47-
ORDER BY L2Distance(vector, [1.0, 1.0])
47+
ORDER BY L2Distance(vec, [1.0, 1.0])
4848
LIMIT 2
4949
SETTINGS vector_search_filter_strategy = 'prefilter'
5050
)
@@ -56,7 +56,7 @@ SELECT trimLeft(explain) FROM (
5656
SELECT id
5757
FROM tab
5858
WHERE attr2 >= 1006
59-
ORDER BY L2Distance(vector, [1.0, 1.0])
59+
ORDER BY L2Distance(vec, [1.0, 1.0])
6060
LIMIT 2
6161
SETTINGS vector_search_filter_strategy = 'prefilter'
6262
)
@@ -68,7 +68,7 @@ SELECT trimLeft(explain) FROM (
6868
SELECT id
6969
FROM tab
7070
WHERE attr1 <= 105
71-
ORDER BY L2Distance(vector, [1.0, 1.0])
71+
ORDER BY L2Distance(vec, [1.0, 1.0])
7272
LIMIT 2
7373
SETTINGS vector_search_filter_strategy = 'prefilter'
7474
)
@@ -80,7 +80,7 @@ SELECT trimLeft(explain) FROM (
8080
SELECT id
8181
FROM tab
8282
WHERE id <= 6
83-
ORDER BY L2Distance(vector, [1.0, 1.0])
83+
ORDER BY L2Distance(vec, [1.0, 1.0])
8484
LIMIT 2
8585
SETTINGS vector_search_filter_strategy = 'prefilter'
8686
)
@@ -93,7 +93,7 @@ SELECT trimLeft(explain) FROM (
9393
EXPLAIN indexes = 1
9494
SELECT id
9595
FROM tab
96-
ORDER BY L2Distance(vector, [1.0, 1.0])
96+
ORDER BY L2Distance(vec, [1.0, 1.0])
9797
LIMIT 2
9898
SETTINGS vector_search_filter_strategy = 'postfilter'
9999
)
@@ -104,8 +104,8 @@ SELECT trimLeft(explain) FROM (
104104
EXPLAIN indexes = 1
105105
SELECT id
106106
FROM tab
107-
WHERE dt <= '2025-01-02'
108-
ORDER BY L2Distance(vector, [1.0, 1.0])
107+
WHERE date <= '2025-01-02'
108+
ORDER BY L2Distance(vec, [1.0, 1.0])
109109
LIMIT 2
110110
SETTINGS vector_search_filter_strategy = 'postfilter'
111111
)
@@ -116,9 +116,9 @@ SELECT trimLeft(explain) FROM (
116116
EXPLAIN indexes = 1
117117
SELECT id
118118
FROM tab
119-
WHERE dt = '2025-01-03'
119+
WHERE date = '2025-01-03'
120120
AND attr1 = 110
121-
ORDER BY L2Distance(vector, [1.0, 1.0])
121+
ORDER BY L2Distance(vec, [1.0, 1.0])
122122
LIMIT 2
123123
SETTINGS vector_search_filter_strategy = 'postfilter'
124124
)
@@ -127,8 +127,8 @@ WHERE explain LIKE '%vector_similarity%';
127127
SELECT '-- Additional WHERE clauses present, 2 full parts selected by partition key / 1 part partially selected by PK, index usage not expected';
128128
SELECT id
129129
FROM tab
130-
WHERE dt = '2025-01-03' AND id <= 9
131-
ORDER BY L2Distance(vector, [1.0, 1.0])
130+
WHERE date = '2025-01-03' AND id <= 9
131+
ORDER BY L2Distance(vec, [1.0, 1.0])
132132
LIMIT 2
133133
SETTINGS log_comment = '02354_vector_search_post_filter_strategy_query1';
134134

@@ -143,16 +143,24 @@ AND type = 'QueryFinish';
143143
SELECT 'The first 3 neighbours returned by vector index dont pass the attr2 >= 1008 filter. Hence no rows returned by the query...';
144144
SELECT id
145145
FROM tab
146-
WHERE dt = '2025-01-03' AND attr2 >= 1008
147-
ORDER BY L2Distance(vector, [1.0, 1.0])
146+
WHERE date = '2025-01-03' AND attr2 >= 1008
147+
ORDER BY L2Distance(vec, [1.0, 1.0])
148148
LIMIT 3;
149149

150150
SELECT '... but there are results for the same query with postfilter multiplier = 2.0';
151151
SELECT id
152152
FROM tab
153-
WHERE dt = '2025-01-03' AND attr2 >= 1008
154-
ORDER BY L2Distance(vector, [1.0, 1.0])
153+
WHERE date = '2025-01-03' AND attr2 >= 1008
154+
ORDER BY L2Distance(vec, [1.0, 1.0])
155155
LIMIT 3
156156
SETTINGS vector_search_postfilter_multiplier = 2.0;
157157

158+
SELECT '-- Negative parameter values throw an exception';
159+
SELECT id
160+
FROM tab
161+
WHERE date = '2025-01-03' AND attr2 >= 1008
162+
ORDER BY L2Distance(vec, [1.0, 1.0])
163+
LIMIT 3
164+
SETTINGS vector_search_postfilter_multiplier = -1.0; -- { serverError INVALID_SETTING_VALUE }
165+
158166
DROP TABLE tab;

0 commit comments

Comments
 (0)