Skip to content

Commit 80b2e72

Browse files
committed
Some fixups
1 parent fb26a35 commit 80b2e72

18 files changed

+208
-183
lines changed

docs/en/engines/table-engines/mergetree-family/annindexes.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -238,16 +238,14 @@ These two strategies determine the order in which the filters are evaluated:
238238
- With pre-filtering, the filter evaluation order is the other way round.
239239

240240
Both strategies have different trade-offs:
241-
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters.
242-
- Pre-filtering is an unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
241+
- Post-filtering has the general problem that it may return less than the number of rows requested in the `LIMIT <N>` clause. This situation happens when at least one of the result rows returned by the vector similarity index fails to satisfy the additional filters.
242+
- Pre-filtering is generally unsolved problem. Some specialized vector databases implement it but most databases including ClickHouse will fall back to exact neighbor search, i.e., a brute-force scan without index.
243243

244244
What strategy is used comes down to whether ClickHouse can use indexes for the additional filter conditions.
245-
246245
If no index can be used, post-filtering will be applied.
247246

248247
If the additional filter condition is part of the partition key, then ClickHouse will apply partition pruning.
249-
250-
Example, assuming that the table is range-partitioned by `year`:
248+
For example, assuming that the table is range-partitioned by `year`:
251249

252250
```sql
253251
WITH [0., 2.] AS reference_vec
@@ -261,14 +259,17 @@ LIMIT 3;
261259
ClickHouse will ignore all partitions but the one for year 2025.
262260
Within this partition, a post-filtering strategy will be applied.
263261

264-
If the additional filter condition is on the primary key and the filter selects some but not all ranges of a part, then Clickhouse will fall back to exact neighbour search i.e brute force scan without index, on the selected ranges of the part. If the primary key filter selects entire parts, Clickhouse will use the vector similarity index on those parts to retrieve results.
262+
If the additional filter condition is on the primary key columns and the filter selects some but not all ranges of a part, then Clickhouse will fall back to exact neighbour search (brute-force scan without index) on the selected ranges of the part.
263+
If the primary key filter selects entire parts, Clickhouse will use the vector similarity index on those parts to retrieve results.
265264

266-
In case additional filter conditions on columns can make use of skip indexes (minmax, set etc), Clickhouse by default chooses a post-filtering strategy. Clickhouse gives higher priority to the vector similarity index because the vector index is expected to deliver business value by accelerating semantic search response times.
265+
In case additional filter conditions on columns can make use of skip indexes (minmax, set etc), Clickhouse by default chooses a post-filtering strategy.
266+
Clickhouse gives higher priority to the vector similarity index because the vector index is expected to deliver business value by accelerating semantic search response times.
267267

268-
Clickhouse provides 2 settings for finer control on post-filtering and pre-filtering -
268+
Clickhouse provides 2 settings for finer control on post-filtering and pre-filtering:
269269

270-
- vector_search_filtering
271-
When the additional filter conditions are extremely selective, it is possible that brute force search on a small filtered set of rows gives better results then post-filtering using the vector search. Users can request explicit pre-filtering by setting ```vector_search_filtering``` to "prefilter" (default is "auto" which equates to "postfilter"). An example query where pre-filtering could be a good choice is -
270+
When the additional filter conditions are extremely selective, it is possible that brute force search on a small filtered set of rows gives better results then post-filtering using the vector search.
271+
Users can force pre-filtering by setting [vector_search_filter_strategy](../../../operations/settings/settings#vector_search_filter_strategy) to `prefilter` (default is `auto` which is equivalent to `postfilter`).
272+
An example query where pre-filtering could be a good choice is
272273

273274
```sql
274275
SELECT bookid, author, title
@@ -278,10 +279,11 @@ ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empire
278279
LIMIT 10
279280
```
280281

281-
Assuming books priced less that $2 are a tiny portion, post-filtering approach may return 0 rows because the top `LIMIT <N>` matches returned by the vector index could all be priced above $2. By opting for explicit pre-filtering, the subset of all books priced less than $2 are shortlisted and then brute-force vector search executed on the subset to return the closest matches.
282+
Assuming that only very few books cost less than $2, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above $2.
283+
By forcing pre-filtering (add `SETTINGS vector_search_filter_strategy = 'prefilter'` to the query), ClickHouse first finds all books with a price of less than $2 and then executes a brute-force vector search on the matches.
282284

283-
- vector_search_postfilter_multiplier
284-
As explained above in the trade-offs, post-filtering could return lesser number of rows then specified in the `LIMIT <N>` clause. Consider this query -
285+
As mentioned above, post-filtering may return less matches then specified in the `LIMIT <N>` clause.
286+
Consider query
285287

286288
```sql
287289
SELECT bookid, author, title
@@ -290,7 +292,11 @@ WHERE published_year <= 2000
290292
ORDER BY cosineDistance(book_vector, getEmbedding('Books on ancient Asian empires'))
291293
LIMIT 10
292294
```
293-
One or more of the 10 nearest matching books returned by the vector index could be published after year 2000. Hence the query will end up returning less than 10 rows, contrary to user expectations. For such cases, the parameter ```vector_search_postfilter_multiplier``` can be set to a value like 2 or 10 to indicate that 20 or 100 nearest matching books should be returned by the vector index and then the additional filter to be applied on those rows to return the result of 10 rows.
295+
296+
With post-filtering, some of the 10 nearest matching books returned by the vector index may be pruned from the result because they were published later than in the year 2000.
297+
As a result, the query may return less rows than the user requested.
298+
For such cases, you can set parameter [vector_search_postfilter_multiplier](../../../operations/settings/settings#vector_search_postfilter_multiplier) to a value > 1.0 (for example, 2.0) to indicate that N times this factor many matches should be returned by the vector index and then the additional filter to be applied on those rows to return the result of 10 rows.
299+
We note that this method can mitigate the problem with post-filtering but in extreme cases (extremely selective WHERE condition), there may still less than N requested rows returned.
294300

295301
### Performance Tuning {#performance-tuning}
296302

src/Core/Settings.cpp

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6586,17 +6586,17 @@ SELECT queries with LIMIT bigger than this setting cannot use vector similarity
65866586
DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"(
65876587
The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'.
65886588
)", EXPERIMENTAL) \
6589-
DECLARE(VectorSearchFilteringType, vector_search_filtering, VectorSearchFilteringType::AUTO, R"(
6590-
If a vector search query has a WHERE clause, this parameter determines if the predicates are evaluated first (pre-filtering) OR if the vector similarity index is looked up first (post-filtering). Please check documentation for additional specifics.
6589+
DECLARE(VectorSearchFilterStrategy, vector_search_filter_strategy, VectorSearchFilterStrategy::AUTO, R"(
6590+
If a vector search query has a WHERE clause, this setting determines if it is evaluated first (pre-filtering) OR if the vector similarity index is checked first (post-filtering).
65916591
65926592
Possible values:
65936593
6594-
AUTO - Currently maps to POSTFILTER semantics.
6595-
PREFILTER - Evaluate other column predicates first and then perform brute-force search to identify neighbours.
6596-
POSTFILTER - Use vector similarity index to identify neighbours and then apply other column predicates.
6594+
'auto' - Postfiltering (the exact semantics may change in future).
6595+
'postfilter' - Use vector similarity index to identify the nearest neighbours, then apply other filters
6596+
'prefilter' - Evaluate other filters first, then perform brute-force search to identify neighbours.
65976597
)", EXPERIMENTAL) \
6598-
DECLARE(UInt64, vector_search_postfilter_multiplier, 1, R"(
6599-
Determines the number of neighbours to fetch from the vector similarity index before performing post-filtering on other predicates. The number of neighbours fetched is (LIMIT n X ann_post_filter_multiplier).
6598+
DECLARE(Float, vector_search_postfilter_multiplier, 1.0, R"(
6599+
Multiply the fetched nearest neighbors from the vector similarity index by this number before performing post-filtering on other predicates.
66006600
)", EXPERIMENTAL) \
66016601
DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"(
66026602
Throw exception if unsupported query is used inside transaction

src/Core/Settings.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ class WriteBuffer;
104104
M(CLASS_NAME, UInt64) \
105105
M(CLASS_NAME, UInt64Auto) \
106106
M(CLASS_NAME, URI) \
107-
M(CLASS_NAME, VectorSearchFilteringType)
107+
M(CLASS_NAME, VectorSearchFilterStrategy)
108108

109109

110110
COMMON_SETTINGS_SUPPORTED_TYPES(Settings, DECLARE_SETTING_TRAIT)

src/Core/SettingsChangesHistory.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,8 @@ const VersionToSettingsChangesMap & getSettingsChangesHistory()
9898
{"allow_experimental_lightweight_update", false, false, "A new setting"},
9999
{"allow_experimental_delta_kernel_rs", true, true, "New setting"},
100100
{"allow_experimental_database_hms_catalog", false, false, "Allow experimental database engine DataLakeCatalog with catalog_type = 'hive'"},
101-
{"vector_search_filtering", "auto", "auto", "Vector search related "},
102-
{"vector_search_postfilter_multiplier", 1, 1, "Vector search related "},
101+
{"vector_search_filter_strategy", "auto", "auto", "New setting"},
102+
{"vector_search_postfilter_multiplier", 1, 1, "New setting"},
103103
{"compile_expressions", false, true, "We believe that the LLVM infrastructure behind the JIT compiler is stable enough to enable this setting by default."},
104104
{"use_legacy_to_time", false, false, "New setting. Allows for user to use the old function logic for toTime, which works as toTimeWithFixedDate."},
105105
});

src/Core/SettingsEnums.cpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -305,9 +305,11 @@ IMPLEMENT_SETTING_ENUM(
305305
{"glue", DatabaseDataLakeCatalogType::GLUE},
306306
{"hive", DatabaseDataLakeCatalogType::ICEBERG_HIVE}})
307307

308-
IMPLEMENT_SETTING_ENUM(VectorSearchFilteringType, ErrorCodes::BAD_ARGUMENTS,
309-
{{"auto", VectorSearchFilteringType::AUTO},
310-
{"prefilter", VectorSearchFilteringType::PREFILTER},
311-
{"postfilter", VectorSearchFilteringType::POSTFILTER}})
308+
IMPLEMENT_SETTING_ENUM(
309+
VectorSearchFilterStrategy,
310+
ErrorCodes::BAD_ARGUMENTS,
311+
{{"auto", VectorSearchFilterStrategy::AUTO},
312+
{"prefilter", VectorSearchFilterStrategy::PREFILTER},
313+
{"postfilter", VectorSearchFilterStrategy::POSTFILTER}})
312314

313315
}

src/Core/SettingsEnums.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -396,13 +396,13 @@ enum class DatabaseDataLakeCatalogType : uint8_t
396396

397397
DECLARE_SETTING_ENUM(DatabaseDataLakeCatalogType)
398398

399-
enum class VectorSearchFilteringType : uint8_t
399+
enum class VectorSearchFilterStrategy : uint8_t
400400
{
401401
AUTO,
402402
PREFILTER,
403403
POSTFILTER,
404404
};
405405

406-
DECLARE_SETTING_ENUM(VectorSearchFilteringType)
406+
DECLARE_SETTING_ENUM(VectorSearchFilterStrategy)
407407

408408
}

src/Processors/QueryPlan/Optimizations/Optimizations.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,9 @@ struct Optimization
3232
struct ExtraSettings
3333
{
3434
size_t max_limit_for_vector_search_queries;
35+
VectorSearchFilterStrategy vector_search_filter_strategy;
3536
size_t use_index_for_in_with_subqueries_max_values;
3637
SizeLimits network_transfer_limits;
37-
VectorSearchFilteringType vector_search_filtering;
3838
};
3939

4040
using Function = size_t (*)(QueryPlan::Node *, QueryPlan::Nodes &, const ExtraSettings &);

src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.cpp

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -37,22 +37,22 @@ namespace Setting
3737
extern const SettingsBool query_plan_convert_join_to_in;
3838
extern const SettingsBool use_query_condition_cache;
3939
extern const SettingsBool query_condition_cache_store_conditions_as_plaintext;
40+
extern const SettingsBool collect_hash_table_stats_during_joins;
41+
extern const SettingsBool query_plan_join_shard_by_pk_ranges;
42+
extern const SettingsBool query_plan_optimize_lazy_materialization;
4043
extern const SettingsBoolAuto query_plan_join_swap_table;
4144
extern const SettingsMaxThreads max_threads;
45+
extern const SettingsOverflowMode transfer_overflow_mode;
4246
extern const SettingsSeconds lock_acquire_timeout;
4347
extern const SettingsString force_optimize_projection_name;
44-
extern const SettingsUInt64 max_limit_for_vector_search_queries;
45-
extern const SettingsUInt64 query_plan_max_optimizations_to_apply;
46-
extern const SettingsBool query_plan_optimize_lazy_materialization;
47-
extern const SettingsUInt64 query_plan_max_limit_for_lazy_materialization;
48-
extern const SettingsBool query_plan_join_shard_by_pk_ranges;
4948
extern const SettingsUInt64 max_bytes_to_transfer;
49+
extern const SettingsUInt64 max_limit_for_vector_search_queries;
5050
extern const SettingsUInt64 max_rows_to_transfer;
51-
extern const SettingsOverflowMode transfer_overflow_mode;
52-
extern const SettingsUInt64 use_index_for_in_with_subqueries_max_values;
5351
extern const SettingsUInt64 max_size_to_preallocate_for_joins;
54-
extern const SettingsBool collect_hash_table_stats_during_joins;
55-
extern const SettingsVectorSearchFilteringType vector_search_filtering;
52+
extern const SettingsUInt64 query_plan_max_limit_for_lazy_materialization;
53+
extern const SettingsUInt64 query_plan_max_optimizations_to_apply;
54+
extern const SettingsUInt64 use_index_for_in_with_subqueries_max_values;
55+
extern const SettingsVectorSearchFilterStrategy vector_search_filter_strategy;
5656
}
5757

5858
namespace ServerSetting
@@ -107,7 +107,7 @@ QueryPlanOptimizationSettings::QueryPlanOptimizationSettings(
107107
optimize_lazy_materialization = from[Setting::query_plan_optimize_lazy_materialization];
108108
max_limit_for_lazy_materialization = from[Setting::query_plan_max_limit_for_lazy_materialization];
109109

110-
vector_search_filtering = from[Setting::vector_search_filtering].value;
110+
vector_search_filter_strategy = from[Setting::vector_search_filter_strategy].value;
111111
max_limit_for_vector_search_queries = from[Setting::max_limit_for_vector_search_queries].value;
112112

113113
query_plan_join_shard_by_pk_ranges = from[Setting::query_plan_join_shard_by_pk_ranges].value;

src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#pragma once
22

3+
#include <Core/SettingsEnums.h>
34
#include <Interpreters/Context_fwd.h>
45
#include <Interpreters/ExpressionActionsSettings.h>
56
#include <QueryPipeline/SizeLimits.h>
6-
#include <Core/Settings.h>
77

88
#include <cstddef>
99

@@ -88,7 +88,7 @@ struct QueryPlanOptimizationSettings
8888
bool optimize_lazy_materialization = false;
8989
size_t max_limit_for_lazy_materialization = 0;
9090

91-
VectorSearchFilteringType vector_search_filtering;
91+
VectorSearchFilterStrategy vector_search_filter_strategy;
9292
size_t max_limit_for_vector_search_queries;
9393

9494
/// Setting needed for Sets (JOIN -> IN optimization)

src/Processors/QueryPlan/Optimizations/optimizeTree.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ void optimizeTreeFirstPass(const QueryPlanOptimizationSettings & optimization_se
4848

4949
Optimization::ExtraSettings extra_settings = {
5050
optimization_settings.max_limit_for_vector_search_queries,
51+
optimization_settings.vector_search_filter_strategy,
5152
optimization_settings.use_index_for_in_with_subqueries_max_values,
5253
optimization_settings.network_transfer_limits,
53-
optimization_settings.vector_search_filtering,
5454
};
5555

5656
while (!stack.empty())

0 commit comments

Comments
 (0)