Skip to content

Commit 754ff07

Browse files
authored
Merge pull request ClickHouse#79854 from shankar-iyer/vector_search_pre_and_post_filtering
Vector search: Pre-filtering & post-filtering
2 parents 16d17d0 + e0095a8 commit 754ff07

21 files changed

+391
-41
lines changed

ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2484,6 +2484,7 @@ positionCaseInsensitiveUTF
24842484
positionUTF
24852485
positiveModulo
24862486
positiveModuloOrNull
2487+
postfilter
24872488
postfix
24882489
postfixes
24892490
postgres
@@ -2501,6 +2502,7 @@ prefetched
25012502
prefetches
25022503
prefetching
25032504
prefetchsize
2505+
prefilter
25042506
preflight
25052507
preimage
25062508
preloaded

docs/en/engines/table-engines/mergetree-family/annindexes.md

Lines changed: 65 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ ClickHouse provides a special "vector similarity" index to perform approximate n
7878
:::note
7979
Vector similarity indexes are currently experimental.
8080
To enable them, please first run `SET allow_experimental_vector_similarity_index = 1`.
81-
If you run into problems, kindly open an issue at github.com/clickhouse/clickhouse/issues.
81+
If you run into problems, kindly open an issue in the [ClickHouse repository](https://github.com/clickhouse/clickhouse/issues).
8282
:::
8383

8484
### Creating a Vector Similarity Index {#creating-a-vector-similarity-index}
@@ -174,13 +174,13 @@ LIMIT <N>
174174
ClickHouse's query optimizer tries to match above query template and make use of available vector similarity indexes.
175175
A query can only use a vector similarity index if the distance function in the SELECT query is the same as the distance function in the index definition.
176176

177-
Advanced users may provide a custom value for setting [hnsw_candidate_list_size_for_search](../../../operations/settings/settings.md#hnsw_candidate_list_size_for_search) (also know as HNSW hyperparameter `ef_search`) to tune the size of the candidate list during search (e.g. `SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>`).
177+
Advanced users may provide a custom value for setting [hnsw_candidate_list_size_for_search](../../../operations/settings/settings.md#hnsw_candidate_list_size_for_search) (also know as HNSW hyperparameter "ef_search") to tune the size of the candidate list during search (e.g. `SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>`).
178178
The default value of the setting 256 works well in the majority of use cases.
179179
Higher setting values mean better accuracy at the cost of slower performance.
180180

181181
If the query can use a vector similarity index, ClickHouse checks that the LIMIT `<N>` provided in SELECT queries is within reasonable bounds.
182182
More specifically, an error is returned if `<N>` is bigger than the value of setting [max_limit_for_vector_search_queries](../../../operations/settings/settings.md#max_limit_for_vector_search_queries) with default value 100.
183-
Too large LIMITs can slow down searches and usually indicate a usage error.
183+
Too large LIMIT values can slow down searches and usually indicate a usage error.
184184

185185
To check if a SELECT query uses a vector similarity index, you can prefix the query with `EXPLAIN indexes = 1`.
186186

@@ -231,23 +231,22 @@ To enforce index usage, you can run the SELECT query with setting [force_data_sk
231231

232232
**Post-filtering and Pre-filtering**
233233

234-
Users may optionally specify a `WHERE` clause with additional filter conditions in SELECT queries.
235-
Depending on these filter conditions, ClickHouse will utilize post-filtering or pre-filtering.
236-
These two strategies determine the order in which the filters are evaluated:
237-
- With post-filtering, the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified of the `WHERE` clause.
238-
- With pre-filtering, the filter evaluation order is the other way round.
234+
Users may optionally specify a `WHERE` clause with additional filter conditions for the SELECT query.
235+
ClickHouse will evaluate these filter conditions using post-filtering or pre-filtering strategy.
236+
In short, both strategies determine the order in which the filters are evaluated:
237+
- Post-filtering means that the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified in the `WHERE` clause.
238+
- Pre-filtering means that the filter evaluation order is the other way round.
239239

240-
Both strategies have different trade-offs:
241-
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters. In ClickHouse, this situation is luckily unlikely to happen because vector similarity indexes do not return rows but blocks with thousands of rows (see "Differences to Regular Skipping Indexes" below).
242-
- Pre-filtering is an unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
240+
The strategies have different trade-offs:
241+
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This situation happens when one or more result rows returned by the vector similarity index fails to satisfy the additional filters.
242+
- Pre-filtering is generally an unsolved problem. Certain specialized vector databases provide pre-filtering algorithms but most relational databases (including ClickHouse) will fall back to exact neighbor search, i.e., a brute-force scan without index.
243243

244-
What strategy is used comes down to whether ClickHouse can use indexes for the additional filter conditions.
244+
What strategy is used depends on the filter condition.
245245

246-
If no index can be used, post-filtering will be applied.
246+
*Additional filters are part of the partition key*
247247

248248
If the additional filter condition is part of the partition key, then ClickHouse will apply partition pruning.
249-
250-
Example, assuming that the table is range-partitioned by `year`:
249+
As an example, a table is range-partitioned by column `year` and the following query is run:
251250

252251
```sql
253252
WITH [0., 2.] AS reference_vec
@@ -258,10 +257,58 @@ ORDER BY L2Distance(vec, reference_vec) ASC
258257
LIMIT 3;
259258
```
260259

261-
ClickHouse will ignore all partitions but the one for year 2025.
262-
Within this partition, a post-filtering strategy will be applied.
260+
ClickHouse will prune all partitions except the 2025 one.
261+
262+
*Additional filters cannot be evaluated using indexes*
263+
264+
If additional filter conditions cannot be evaluated using indexes (primary key index, skipping index), ClickHouse will apply post-filtering.
265+
266+
*Additional filters can be evaluated using the primary key index*
267+
268+
If additional filter conditions can be evaluated using the [primary key](mergetree.md#primary-key) (i.e., they form a prefix of the primary key) and
269+
- the filter condition eliminates at least one row within a part, the ClickHouse will fall back to pre-filtering for the "surviving" ranges within the part,
270+
- the filter condition eliminates no rows within a part, the ClickHouse will perform post-filtering for the part.
271+
272+
In practical use cases, the latter case is rather unlikely.
273+
274+
*Additional filters can be evaluated using skipping index*
275+
276+
If additional filter conditions can be evaluated using [skipping indexes](mergetree.md#table_engine-mergetree-data_skipping-indexes) (minmax index, set index, etc.), Clickhouse performs post-filtering.
277+
In such cases, the vector similarity index is evaluated first as it is expected to remove the most rows relative to other skipping indexes.
278+
279+
For finer control over post-filtering vs. pre-filtering, two settings can be used:
280+
281+
Setting [vector_search_filter_strategy](../../../operations/settings/settings#vector_search_filter_strategy) (default: `auto` which implements above heuristics) may be set to `prefilter`.
282+
This is useful to force pre-filtering in cases where the additional filter conditions are extremely selective.
283+
As an example, the following query may benefit from pre-filtering:
284+
285+
```sql
286+
SELECT bookid, author, title
287+
FROM books
288+
WHERE price < 2.00
289+
ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empires'))
290+
LIMIT 10
291+
```
292+
293+
Assuming that only a very small number of books cost less than $2, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above $2.
294+
By forcing pre-filtering (add `SETTINGS vector_search_filter_strategy = 'prefilter'` to the query), ClickHouse first finds all books with a price of less than $2 and then executes a brute-force vector search for the found books.
295+
296+
As an alternative approach to resolve above issue, setting [vector_search_postfilter_multiplier](../../../operations/settings/settings#vector_search_postfilter_multiplier) (default: `1.0`) may be configured to a value > `1.0` (for example, `2.0`).
297+
The number of nearest neighbors fetched from the vector index is multiplied by the setting value and then the additional filter to be applied on those rows to return LIMIT-many rows.
298+
As an example, we can query again but with multiplier `3.0`:
299+
300+
```sql
301+
SELECT bookid, author, title
302+
FROM books
303+
WHERE price < 2.00
304+
ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empires'))
305+
LIMIT 10
306+
SETTING vector_search_postfilter_multiplier = 3.0;
307+
```
263308

264-
If the additional filter condition is part of the primary key, then ClickHouse will always apply pre-filtering.
309+
ClickHouse will fetch 3.0 x 10 = 30 nearest neighbors from the vector index in each part and afterwards evaluate the additional filters.
310+
Only the ten closest neighbors will be returned.
311+
We note that setting `vector_search_postfilter_multiplier` can mitigate the problem but in extreme cases (very selective WHERE condition), it is still possible that less than N requested rows returned.
265312

266313
### Performance Tuning {#performance-tuning}
267314

src/Core/Settings.cpp

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6567,10 +6567,7 @@ Enable experimental hash functions
65676567
Allow the obsolete Object data type
65686568
)", EXPERIMENTAL) \
65696569
DECLARE(Bool, allow_experimental_time_series_table, false, R"(
6570-
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine.
6571-
6572-
Possible values:
6573-
6570+
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine. Possible values:
65746571
- 0 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled.
65756572
- 1 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled.
65766573
)", EXPERIMENTAL) \
@@ -6585,6 +6582,15 @@ SELECT queries with LIMIT bigger than this setting cannot use vector similarity
65856582
)", EXPERIMENTAL) \
65866583
DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"(
65876584
The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'.
6585+
)", EXPERIMENTAL) \
6586+
DECLARE(VectorSearchFilterStrategy, vector_search_filter_strategy, VectorSearchFilterStrategy::AUTO, R"(
6587+
If a vector search query has a WHERE clause, this setting determines if it is evaluated first (pre-filtering) OR if the vector similarity index is checked first (post-filtering). Possible values:
6588+
- 'auto' - Postfiltering (the exact semantics may change in future).
6589+
- 'postfilter' - Use vector similarity index to identify the nearest neighbours, then apply other filters
6590+
- 'prefilter' - Evaluate other filters first, then perform brute-force search to identify neighbours.
6591+
)", EXPERIMENTAL) \
6592+
DECLARE(Float, vector_search_postfilter_multiplier, 1.0, R"(
6593+
Multiply the fetched nearest neighbors from the vector similarity index by this number before performing post-filtering on other predicates.
65886594
)", EXPERIMENTAL) \
65896595
DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"(
65906596
Throw exception if unsupported query is used inside transaction

src/Core/Settings.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ class WriteBuffer;
103103
M(CLASS_NAME, TransactionsWaitCSNMode) \
104104
M(CLASS_NAME, UInt64) \
105105
M(CLASS_NAME, UInt64Auto) \
106-
M(CLASS_NAME, URI)
106+
M(CLASS_NAME, URI) \
107+
M(CLASS_NAME, VectorSearchFilterStrategy)
107108

108109

109110
COMMON_SETTINGS_SUPPORTED_TYPES(Settings, DECLARE_SETTING_TRAIT)

src/Core/SettingsChangesHistory.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ const VersionToSettingsChangesMap & getSettingsChangesHistory()
9898
{"allow_experimental_lightweight_update", false, false, "A new setting"},
9999
{"allow_experimental_delta_kernel_rs", true, true, "New setting"},
100100
{"allow_experimental_database_hms_catalog", false, false, "Allow experimental database engine DataLakeCatalog with catalog_type = 'hive'"},
101+
{"vector_search_filter_strategy", "auto", "auto", "New setting"},
102+
{"vector_search_postfilter_multiplier", 1, 1, "New setting"},
101103
{"compile_expressions", false, true, "We believe that the LLVM infrastructure behind the JIT compiler is stable enough to enable this setting by default."},
102104
{"use_legacy_to_time", false, false, "New setting. Allows for user to use the old function logic for toTime, which works as toTimeWithFixedDate."},
103105
{"input_format_parquet_allow_geoparquet_parser", false, true, "A new setting to use geo columns in parquet file"},

src/Core/SettingsEnums.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -305,4 +305,11 @@ IMPLEMENT_SETTING_ENUM(
305305
{"glue", DatabaseDataLakeCatalogType::GLUE},
306306
{"hive", DatabaseDataLakeCatalogType::ICEBERG_HIVE}})
307307

308+
IMPLEMENT_SETTING_ENUM(
309+
VectorSearchFilterStrategy,
310+
ErrorCodes::BAD_ARGUMENTS,
311+
{{"auto", VectorSearchFilterStrategy::AUTO},
312+
{"prefilter", VectorSearchFilterStrategy::PREFILTER},
313+
{"postfilter", VectorSearchFilterStrategy::POSTFILTER}})
314+
308315
}

src/Core/SettingsEnums.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,4 +396,13 @@ enum class DatabaseDataLakeCatalogType : uint8_t
396396

397397
DECLARE_SETTING_ENUM(DatabaseDataLakeCatalogType)
398398

399+
enum class VectorSearchFilterStrategy : uint8_t
400+
{
401+
AUTO,
402+
PREFILTER,
403+
POSTFILTER,
404+
};
405+
406+
DECLARE_SETTING_ENUM(VectorSearchFilterStrategy)
407+
399408
}

src/Processors/QueryPlan/Optimizations/Optimizations.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ struct Optimization
3232
struct ExtraSettings
3333
{
3434
size_t max_limit_for_vector_search_queries;
35+
VectorSearchFilterStrategy vector_search_filter_strategy;
3536
size_t use_index_for_in_with_subqueries_max_values;
3637
SizeLimits network_transfer_limits;
3738
};

src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.cpp

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -37,21 +37,22 @@ namespace Setting
3737
extern const SettingsBool query_plan_convert_join_to_in;
3838
extern const SettingsBool use_query_condition_cache;
3939
extern const SettingsBool query_condition_cache_store_conditions_as_plaintext;
40+
extern const SettingsBool collect_hash_table_stats_during_joins;
41+
extern const SettingsBool query_plan_join_shard_by_pk_ranges;
42+
extern const SettingsBool query_plan_optimize_lazy_materialization;
4043
extern const SettingsBoolAuto query_plan_join_swap_table;
4144
extern const SettingsMaxThreads max_threads;
45+
extern const SettingsOverflowMode transfer_overflow_mode;
4246
extern const SettingsSeconds lock_acquire_timeout;
4347
extern const SettingsString force_optimize_projection_name;
44-
extern const SettingsUInt64 max_limit_for_vector_search_queries;
45-
extern const SettingsUInt64 query_plan_max_optimizations_to_apply;
46-
extern const SettingsBool query_plan_optimize_lazy_materialization;
47-
extern const SettingsUInt64 query_plan_max_limit_for_lazy_materialization;
48-
extern const SettingsBool query_plan_join_shard_by_pk_ranges;
4948
extern const SettingsUInt64 max_bytes_to_transfer;
49+
extern const SettingsUInt64 max_limit_for_vector_search_queries;
5050
extern const SettingsUInt64 max_rows_to_transfer;
51-
extern const SettingsOverflowMode transfer_overflow_mode;
52-
extern const SettingsUInt64 use_index_for_in_with_subqueries_max_values;
5351
extern const SettingsUInt64 max_size_to_preallocate_for_joins;
54-
extern const SettingsBool collect_hash_table_stats_during_joins;
52+
extern const SettingsUInt64 query_plan_max_limit_for_lazy_materialization;
53+
extern const SettingsUInt64 query_plan_max_optimizations_to_apply;
54+
extern const SettingsUInt64 use_index_for_in_with_subqueries_max_values;
55+
extern const SettingsVectorSearchFilterStrategy vector_search_filter_strategy;
5556
}
5657

5758
namespace ServerSetting
@@ -106,7 +107,9 @@ QueryPlanOptimizationSettings::QueryPlanOptimizationSettings(
106107
optimize_lazy_materialization = from[Setting::query_plan_optimize_lazy_materialization];
107108
max_limit_for_lazy_materialization = from[Setting::query_plan_max_limit_for_lazy_materialization];
108109

110+
vector_search_filter_strategy = from[Setting::vector_search_filter_strategy].value;
109111
max_limit_for_vector_search_queries = from[Setting::max_limit_for_vector_search_queries].value;
112+
110113
query_plan_join_shard_by_pk_ranges = from[Setting::query_plan_join_shard_by_pk_ranges].value;
111114

112115
network_transfer_limits = SizeLimits(from[Setting::max_rows_to_transfer], from[Setting::max_bytes_to_transfer], from[Setting::transfer_overflow_mode]);

src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
#pragma once
22

3+
#include <Core/SettingsEnums.h>
34
#include <Interpreters/Context_fwd.h>
45
#include <Interpreters/ExpressionActionsSettings.h>
56
#include <QueryPipeline/SizeLimits.h>
@@ -87,6 +88,7 @@ struct QueryPlanOptimizationSettings
8788
bool optimize_lazy_materialization = false;
8889
size_t max_limit_for_lazy_materialization = 0;
8990

91+
VectorSearchFilterStrategy vector_search_filter_strategy;
9092
size_t max_limit_for_vector_search_queries;
9193

9294
/// Setting needed for Sets (JOIN -> IN optimization)

0 commit comments

Comments
 (0)