You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/engines/table-engines/mergetree-family/annindexes.md
+20-14Lines changed: 20 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -238,16 +238,14 @@ These two strategies determine the order in which the filters are evaluated:
238
238
- With pre-filtering, the filter evaluation order is the other way round.
239
239
240
240
Both strategies have different trade-offs:
241
-
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters.
242
-
- Pre-filtering is an unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
241
+
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This situation happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters.
242
+
- Pre-filtering is generally unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
243
243
244
244
What strategy is used comes down to whether ClickHouse can use indexes for the additional filter conditions.
245
-
246
245
If no index can be used, post-filtering will be applied.
247
246
248
247
If the additional filter condition is part of the partition key, then ClickHouse will apply partition pruning.
249
-
250
-
Example, assuming that the table is range-partitioned by `year`:
248
+
For example, assuming that the table is range-partitioned by `year`:
251
249
252
250
```sql
253
251
WITH [0., 2.] AS reference_vec
@@ -261,14 +259,17 @@ LIMIT 3;
261
259
ClickHouse will ignore all partitions but the one for year 2025.
262
260
Within this partition, a post-filtering strategy will be applied.
263
261
264
-
If the additional filter condition is on the primary key and the filter selects some but not all ranges of a part, then Clickhouse will fall back to exact neighbour search i.e brute force scan without index, on the selected ranges of the part. If the primary key filter selects entire parts, Clickhouse will use the vector similarity index on those parts to retrieve results.
262
+
If the additional filter condition is on the primary key columns and the filter selects some but not all ranges of a part, then Clickhouse will fall back to exact neighbour search (brute-force scan without index) on the selected ranges of the part.
263
+
If the primary key filter selects entire parts, Clickhouse will use the vector similarity index on those parts to retrieve results.
265
264
266
-
In case additional filter conditions on columns can make use of skip indexes (minmax, set etc), Clickhouse by default chooses a post-filtering strategy. Clickhouse gives higher priority to the vector similarity index because the vector index is expected to deliver business value by accelerating semantic search response times.
265
+
In case additional filter conditions on columns can make use of skip indexes (minmax, set etc), Clickhouse by default chooses a post-filtering strategy.
266
+
Clickhouse gives higher priority to the vector similarity index because the vector index is expected to deliver business value by accelerating semantic search response times.
267
267
268
-
Clickhouse provides 2 settings for finer control on post-filtering and pre-filtering -
268
+
Clickhouse provides 2 settings for finer control on post-filtering and pre-filtering:
269
269
270
-
- vector_search_filtering
271
-
When the additional filter conditions are extremely selective, it is possible that brute force search on a small filtered set of rows gives better results then post-filtering using the vector search. Users can request explicit pre-filtering by setting ```vector_search_filtering``` to "prefilter" (default is "auto" which equates to "postfilter"). An example query where pre-filtering could be a good choice is -
270
+
When the additional filter conditions are extremely selective, it is possible that brute force search on a small filtered set of rows gives better results then post-filtering using the vector search.
271
+
Users can force pre-filtering by setting [vector_search_filter_strategy](../../../operations/settings/settings#vector_search_filter_strategy) to `prefilter` (default is `auto` which is equivalent to `postfilter`).
272
+
An example query where pre-filtering could be a good choice is
272
273
273
274
```sql
274
275
SELECT bookid, author, title
@@ -278,10 +279,11 @@ ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empire
278
279
LIMIT10
279
280
```
280
281
281
-
Assuming books priced less that $2 are a tiny portion, post-filtering approach may return 0 rows because the top `LIMIT <N>` matches returned by the vector index could all be priced above $2. By opting for explicit pre-filtering, the subset of all books priced less than $2 are shortlisted and then brute-force vector search executed on the subset to return the closest matches.
282
+
Assuming that only very few books cost less than $2, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above $2.
283
+
By forcing pre-filtering (add `SETTINGS vector_search_filter_strategy = 'prefilter'` to the query), ClickHouse first finds all books with a price of less than $2 and then executes a brute-force vector search on the matches.
282
284
283
-
- vector_search_postfilter_multiplier
284
-
As explained above in the trade-offs, post-filtering could return lesser number of rows then specified in the `LIMIT <N>` clause. Consider this query -
285
+
As mentioned above, post-filtering may return less matches then specified in the `LIMIT <N>` clause.
286
+
Consider query
285
287
286
288
```sql
287
289
SELECT bookid, author, title
@@ -290,7 +292,11 @@ WHERE published_year <= 2000
290
292
ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empires'))
291
293
LIMIT10
292
294
```
293
-
One or more of the 10 nearest matching books returned by the vector index could be published after year 2000. Hence the query will end up returning less than 10 rows, contrary to user expectations. For such cases, the parameter ```vector_search_postfilter_multiplier``` can be set to a value like 2 or 10 to indicate that 20 or 100 nearest matching books should be returned by the vector index and then the additional filter to be applied on those rows to return the result of 10 rows.
295
+
296
+
With post-filtering, some of the 10 nearest matching books returned by the vector index may be pruned from the result because they were published later than in the year 2000.
297
+
As a result, the query may return less rows than the user requested.
298
+
For such cases, you can set parameter [vector_search_postfilter_multiplier](../../../operations/settings/settings#vector_search_postfilter_multiplier) to a value > 1.0 (for example, 2.0) to indicate that N times this factor many matches should be returned by the vector index and then the additional filter to be applied on those rows to return the result of 10 rows.
299
+
We note that this method can mitigate the problem with post-filtering but in extreme cases (extremely selective WHERE condition), there may still less than N requested rows returned.
If a vector search query has a WHERE clause, this parameter determines if the predicates are evaluated first (pre-filtering) OR if the vector similarity index is looked up first (post-filtering). Please check documentation for additional specifics.
If a vector search query has a WHERE clause, this setting determines if it is evaluated first (pre-filtering) OR if the vector similarity index is checked first (post-filtering).
6591
6591
6592
6592
Possible values:
6593
6593
6594
-
AUTO - Currently maps to POSTFILTER semantics.
6595
-
PREFILTER - Evaluate other column predicates first and then perform brute-force search to identify neighbours.
6596
-
POSTFILTER - Use vector similarity index to identify neighbours and then apply other column predicates.
6594
+
'auto' - Postfiltering (the exact semantics may change in future).
6595
+
'postfilter' - Use vector similarity index to identify the nearest neighbours, then apply other filters
6596
+
'prefilter' - Evaluate other filters first, then perform brute-force search to identify neighbours.
Determines the number of neighbours to fetch from the vector similarity index before performing post-filtering on other predicates. The number of neighbours fetched is (LIMIT n X ann_post_filter_multiplier).
{"vector_search_filtering", "auto", "auto", "Vector search related "},
102
-
{"vector_search_postfilter_multiplier", 1, 1, "Vector search related "},
101
+
{"vector_search_filter_strategy", "auto", "auto", "New setting"},
102
+
{"vector_search_postfilter_multiplier", 1, 1, "New setting"},
103
103
{"compile_expressions", false, true, "We believe that the LLVM infrastructure behind the JIT compiler is stable enough to enable this setting by default."},
104
104
{"use_legacy_to_time", false, false, "New setting. Allows for user to use the old function logic for toTime, which works as toTimeWithFixedDate."},
0 commit comments