Skip to content

Commit decf72f

Browse files
authored
Merge pull request ClickHouse#80262 from ClickHouse/ahmadov/gin-tokenizer-support
Support explicit parameters for GIN index definition
2 parents b8213f2 + b5a43fc commit decf72f

20 files changed

+478
-130
lines changed

docs/en/engines/table-engines/mergetree-family/invertedindexes.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,28 +54,34 @@ CREATE TABLE tab
5454
(
5555
`key` UInt64,
5656
`str` String,
57-
INDEX inv_idx(str) TYPE gin(0) GRANULARITY 1
57+
INDEX inv_idx(str) TYPE gin(tokenizer = 'default|ngram|noop' [, ngram_size = N] [, max_rows_per_postings_list = M]) GRANULARITY 1
5858
)
5959
ENGINE = MergeTree
6060
ORDER BY key
6161
```
6262

63-
where `N` specifies the tokenizer:
63+
where `tokenizer` specifies the tokenizer:
6464

65-
- `gin(0)` (or shorter: `gin()`) set the tokenizer to "tokens", i.e. split strings along spaces,
66-
- `gin(N)` with `N` between 2 and 8 sets the tokenizer to "ngrams(N)"
65+
- `default` set the tokenizer to "tokens('default')", i.e. split strings along non-alphanumeric characters.
66+
- `ngram` set the tokenizer to "tokens('ngram')". i.e. splits strings to equal size terms.
67+
- `noop` set the tokenizer to "tokens('noop')", i.e. every value itself is a term.
6768

68-
The maximum rows per postings list can be specified as the second parameter. This parameter can be used to control postings list sizes to avoid generating huge postings list files. The following variants exist:
69+
The ngram size can be specified via the `ngram_size` parameter. This is an optional parameter. The following variants exist:
6970

70-
- `gin(ngrams, max_rows_per_postings_list)`: Use given max_rows_per_postings_list (assuming it is not 0)
71-
- `gin(ngrams, 0)`: No limitation of maximum rows per postings list
72-
- `gin(ngrams)`: Use a default maximum rows which is 64K.
71+
- `ngram_size = N`: with `N` between 2 and 8 sets the tokenizer to "tokens('ngram', N)".
72+
- If not specified: Use a default ngram size which is 3.
73+
74+
The maximum rows per postings list can be specified via an optional `max_rows_per_postings_list`. This parameter can be used to control postings list sizes to avoid generating huge postings list files. The following variants exist:
75+
76+
- `max_rows_per_postings_list = 0`: No limitation of maximum rows per postings list.
77+
- `max_rows_per_postings_list = M`: with `M` should be at least 8192.
78+
- If not specified: Use a default maximum rows which is 64K.
7379

7480
Being a type of skipping index, full-text indexes can be dropped or added to a column after table creation:
7581

7682
```sql
7783
ALTER TABLE tab DROP INDEX inv_idx;
78-
ALTER TABLE tab ADD INDEX inv_idx(s) TYPE gin(2);
84+
ALTER TABLE tab ADD INDEX inv_idx(s) TYPE gin(tokenizer = 'default');
7985
```
8086

8187
To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As

src/Interpreters/GinFilter.cpp

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,12 @@
1616
namespace DB
1717
{
1818

19-
namespace ErrorCodes
20-
{
21-
extern const int BAD_ARGUMENTS;
22-
}
23-
24-
GinFilterParameters::GinFilterParameters(size_t ngrams_, UInt64 max_rows_per_postings_list_)
25-
: ngrams(ngrams_)
19+
GinFilterParameters::GinFilterParameters(String tokenizer_, UInt64 max_rows_per_postings_list_)
20+
: tokenizer(std::move(tokenizer_))
2621
, max_rows_per_postings_list(max_rows_per_postings_list_)
2722
{
2823
if (max_rows_per_postings_list == UNLIMITED_ROWS_PER_POSTINGS_LIST)
2924
max_rows_per_postings_list = std::numeric_limits<UInt64>::max();
30-
31-
if (ngrams > 8)
32-
throw Exception(ErrorCodes::BAD_ARGUMENTS, "The size of full-text index filter cannot be greater than 8");
3325
}
3426

3527
GinFilter::GinFilter(const GinFilterParameters & params_)

src/Interpreters/GinFilter.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ static inline constexpr UInt64 DEFAULT_MAX_ROWS_PER_POSTINGS_LIST = 64 * 1024;
1515

1616
struct GinFilterParameters
1717
{
18-
GinFilterParameters(size_t ngrams_, UInt64 max_rows_per_postings_list_);
18+
GinFilterParameters(String tokenizer_, UInt64 max_rows_per_postings_list_);
1919

20-
size_t ngrams;
20+
String tokenizer;
2121
UInt64 max_rows_per_postings_list;
2222
};
2323

src/Storages/IndicesDescription.cpp

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,45 @@ namespace ErrorCodes
2828
namespace
2929
{
3030
using ReplaceAliasToExprVisitor = InDepthNodeVisitor<ReplaceAliasByExpressionMatcher, true>;
31+
32+
33+
Tuple parseGinIndexArgumentFromAST(const ASTPtr & arguments)
34+
{
35+
const auto & identifier = arguments->children[0]->template as<ASTIdentifier>();
36+
if (identifier == nullptr)
37+
throw Exception(ErrorCodes::INCORRECT_QUERY, "Expected identifier");
38+
39+
const auto & literal = arguments->children[1]->template as<ASTLiteral>();
40+
if (literal == nullptr)
41+
throw Exception(ErrorCodes::INCORRECT_QUERY, "Expected literal");
42+
43+
Tuple key_value_pair{};
44+
key_value_pair.emplace_back(identifier->name());
45+
key_value_pair.emplace_back(literal->value);
46+
return key_value_pair;
47+
}
48+
49+
bool parseGinIndexArgumentsFromAST(const ASTPtr & arguments, FieldVector & parsed_arguments)
50+
{
51+
parsed_arguments.reserve(arguments->children.size());
52+
53+
for (const auto & argument : arguments->children)
54+
{
55+
if (const auto * ast_function = argument->template as<ASTFunction>();
56+
ast_function && ast_function->name == "equals" && ast_function->arguments->children.size() == 2)
57+
{
58+
parsed_arguments.emplace_back(parseGinIndexArgumentFromAST(ast_function->arguments));
59+
}
60+
else
61+
{
62+
if (!parsed_arguments.empty())
63+
throw Exception(ErrorCodes::INCORRECT_QUERY, "Cannot mix key-value pair and single argument as GIN index arguments");
64+
return false;
65+
}
66+
}
67+
68+
return true;
69+
}
3170
}
3271

3372
IndexDescription::IndexDescription(const IndexDescription & other)
@@ -129,6 +168,10 @@ IndexDescription IndexDescription::getIndexFromAST(const ASTPtr & definition_ast
129168

130169
if (index_type && index_type->arguments)
131170
{
171+
bool is_gin_index = index_type->name == "gin" || index_type->name == "inverted" || index_type->name == "full_text";
172+
if (is_gin_index && parseGinIndexArgumentsFromAST(index_type->arguments, result.arguments))
173+
return result;
174+
132175
for (size_t i = 0; i < index_type->arguments->children.size(); ++i)
133176
{
134177
const auto & child = index_type->arguments->children[i];

src/Storages/MergeTree/MergeTreeIndexGin.cpp

Lines changed: 94 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,13 @@
1616
#include <Interpreters/misc.h>
1717
#include <Storages/MergeTree/MergeTreeData.h>
1818
#include <Storages/MergeTree/RPNBuilder.h>
19+
#include <boost/algorithm/string/predicate.hpp>
1920
#include <Common/OptimizedRegularExpression.h>
21+
#include <Core/Field.h>
22+
#include <Interpreters/ITokenExtractor.h>
23+
#include <base/types.h>
2024
#include <algorithm>
2125

22-
2326
namespace DB
2427
{
2528
namespace ErrorCodes
@@ -30,6 +33,10 @@ namespace ErrorCodes
3033
extern const int LOGICAL_ERROR;
3134
}
3235

36+
static const String ARGUMENT_TOKENIZER = "tokenizer";
37+
static const String ARGUMENT_NGRAM_SIZE = "ngram_size";
38+
static const String ARGUMENT_MAX_ROWS = "max_rows_per_postings_list";
39+
3340
MergeTreeIndexGranuleGin::MergeTreeIndexGranuleGin(
3441
const String & index_name_,
3542
size_t columns_number,
@@ -758,44 +765,107 @@ MergeTreeIndexConditionPtr MergeTreeIndexGin::createIndexCondition(const Actions
758765
return std::make_shared<MergeTreeIndexConditionGin>(predicate, context, index.sample_block, gin_filter_params, token_extractor.get());
759766
}
760767

761-
MergeTreeIndexPtr ginIndexCreator(const IndexDescription & index)
768+
namespace
762769
{
763-
size_t ngram_length = index.arguments.empty() ? 0 : index.arguments[0].safeGet<size_t>();
764-
UInt64 max_rows = index.arguments.size() < 2 ? DEFAULT_MAX_ROWS_PER_POSTINGS_LIST : index.arguments[1].safeGet<UInt64>();
765-
GinFilterParameters gin_filter_params(ngram_length, max_rows);
766770

767-
if (ngram_length == 0)
771+
std::unordered_map<String, Field> convertArgumentsToOptionsMap(const FieldVector & arguments)
772+
{
773+
std::unordered_map<String, Field> options;
774+
for (const Field & argument : arguments)
768775
{
769-
auto tokenizer = std::make_unique<SplitTokenExtractor>();
770-
return std::make_shared<MergeTreeIndexGin>(index, gin_filter_params, std::move(tokenizer));
776+
if (argument.getType() != Field::Types::Tuple)
777+
throw Exception(ErrorCodes::INCORRECT_QUERY, "Arguments of GIN index must be key-value pair (identifier = literal)");
778+
Tuple tuple = argument.template safeGet<Tuple>();
779+
String key = tuple[0].safeGet<String>();
780+
if (options.contains(key))
781+
throw Exception(ErrorCodes::INCORRECT_QUERY, "GIN index '{}' argument is specified more than once", key);
782+
options[key] = tuple[1];
771783
}
772-
else
784+
return options;
785+
}
786+
787+
template <typename Type>
788+
std::optional<Type> getOption(const std::unordered_map<String, Field> & options, const String & option)
789+
{
790+
if (auto it = options.find(option); it != options.end())
773791
{
774-
auto tokenizer = std::make_unique<NgramTokenExtractor>(ngram_length);
775-
return std::make_shared<MergeTreeIndexGin>(index, gin_filter_params, std::move(tokenizer));
792+
const Field & value = it->second;
793+
Field::Types::Which expected_type = Field::TypeToEnum<Type>::value;
794+
if (value.getType() != expected_type)
795+
throw Exception(
796+
ErrorCodes::INCORRECT_QUERY,
797+
"GIN index argument '{}' expected to be {}, but got {}",
798+
option,
799+
fieldTypeToString(expected_type),
800+
value.getTypeName());
801+
return value.safeGet<Type>();
776802
}
803+
return std::nullopt;
804+
}
805+
806+
}
807+
808+
MergeTreeIndexPtr ginIndexCreator(const IndexDescription & index)
809+
{
810+
std::unordered_map<String, Field> options = convertArgumentsToOptionsMap(index.arguments);
811+
812+
String tokenizer = getOption<String>(options, ARGUMENT_TOKENIZER).value();
813+
814+
std::unique_ptr<ITokenExtractor> token_extractor;
815+
if (tokenizer == SplitTokenExtractor::getExternalName())
816+
token_extractor = std::make_unique<SplitTokenExtractor>();
817+
else if (tokenizer == NoOpTokenExtractor::getExternalName())
818+
token_extractor = std::make_unique<NoOpTokenExtractor>();
819+
else if (tokenizer == NgramTokenExtractor::getExternalName())
820+
{
821+
UInt64 ngram_size = getOption<UInt64>(options, ARGUMENT_NGRAM_SIZE).value_or(3);
822+
token_extractor = std::make_unique<NgramTokenExtractor>(ngram_size);
823+
}
824+
else
825+
throw Exception(ErrorCodes::LOGICAL_ERROR, "Tokenizer {} not supported", tokenizer);
826+
827+
UInt64 max_rows_per_postings_list = getOption<UInt64>(options, ARGUMENT_MAX_ROWS).value_or(DEFAULT_MAX_ROWS_PER_POSTINGS_LIST);
828+
829+
GinFilterParameters params(tokenizer, max_rows_per_postings_list);
830+
return std::make_shared<MergeTreeIndexGin>(index, params, std::move(token_extractor));
777831
}
778832

779833
void ginIndexValidator(const IndexDescription & index, bool /*attach*/)
780834
{
781-
/// Check number and type of arguments
782-
if (index.arguments.size() > 2)
783-
throw Exception(ErrorCodes::INCORRECT_QUERY, "GIN index must have less than two arguments");
835+
std::unordered_map<String, Field> options = convertArgumentsToOptionsMap(index.arguments);
784836

785-
if (!index.arguments.empty() && index.arguments[0].getType() != Field::Types::UInt64)
786-
throw Exception(ErrorCodes::INCORRECT_QUERY, "First argument of GIN index (tokenizer) must be of type UInt64");
837+
/// Check that tokenizer is present and supported
838+
std::optional<String> tokenizer = getOption<String>(options, ARGUMENT_TOKENIZER);
839+
if (!tokenizer)
840+
throw Exception(ErrorCodes::INCORRECT_QUERY, "GIN index must have an '{}' argument", ARGUMENT_TOKENIZER);
787841

788-
if (index.arguments.size() == 2)
842+
const bool is_supported_tokenizer = (tokenizer.value() == SplitTokenExtractor::getExternalName()
843+
|| tokenizer.value() == NoOpTokenExtractor::getExternalName()
844+
|| tokenizer.value() == NgramTokenExtractor::getExternalName());
845+
if (!is_supported_tokenizer)
846+
throw Exception(
847+
ErrorCodes::INCORRECT_QUERY,
848+
"GIN index '{}' argument supports only 'default', 'ngram', and 'noop', but got {}",
849+
ARGUMENT_TOKENIZER,
850+
tokenizer.value());
851+
852+
if (tokenizer.value() == NgramTokenExtractor::getExternalName())
789853
{
790-
if (index.arguments[1].getType() != Field::Types::UInt64)
791-
throw Exception(ErrorCodes::INCORRECT_QUERY, "Second argument of GIN index (max_rows_per_postings_list) must be of type UInt64");
792-
if (index.arguments[1].safeGet<UInt64>() != UNLIMITED_ROWS_PER_POSTINGS_LIST && index.arguments[1].safeGet<UInt64>() < MIN_ROWS_PER_POSTINGS_LIST)
793-
throw Exception(ErrorCodes::INCORRECT_QUERY, "Second argument of GIN index (max_rows_per_postings_list) must not be less than {}", MIN_ROWS_PER_POSTINGS_LIST);
854+
UInt64 ngram_size = getOption<UInt64>(options, ARGUMENT_NGRAM_SIZE).value_or(3);
855+
if (ngram_size < 2 || ngram_size > 8)
856+
throw Exception(
857+
ErrorCodes::INCORRECT_QUERY,
858+
"GIN index '{}' argument must be between 2 and 8, but got {}", ARGUMENT_NGRAM_SIZE, ngram_size);
794859
}
795860

796-
size_t ngram_length = index.arguments.empty() ? 0 : index.arguments[0].safeGet<size_t>();
797-
UInt64 max_rows_per_postings_list = index.arguments.size() < 2 ? DEFAULT_MAX_ROWS_PER_POSTINGS_LIST : index.arguments[1].safeGet<UInt64>();
798-
GinFilterParameters gin_filter_params(ngram_length, max_rows_per_postings_list); /// Just validate
861+
/// Check that max_rows_per_postings_list is valid (if present)
862+
UInt64 max_rows_per_postings_list = getOption<UInt64>(options, ARGUMENT_MAX_ROWS).value_or(DEFAULT_MAX_ROWS_PER_POSTINGS_LIST);
863+
if (max_rows_per_postings_list != UNLIMITED_ROWS_PER_POSTINGS_LIST && max_rows_per_postings_list < MIN_ROWS_PER_POSTINGS_LIST)
864+
throw Exception(
865+
ErrorCodes::INCORRECT_QUERY,
866+
"GIN index '{}' should not be less than {}", ARGUMENT_MAX_ROWS, MIN_ROWS_PER_POSTINGS_LIST);
867+
868+
GinFilterParameters gin_filter_params(tokenizer.value(), max_rows_per_postings_list); /// Just validate
799869

800870
/// Check that the index is created on a single column
801871
if (index.column_names.size() != 1 || index.data_types.size() != 1)

tests/queries/0_stateless/02346_gin_index_bug47393.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ CREATE TABLE tab
77
(
88
id UInt64,
99
str String,
10-
INDEX idx str TYPE gin(3) GRANULARITY 1
10+
INDEX idx str TYPE gin(tokenizer = 'ngram', ngram_size = 3) GRANULARITY 1
1111
)
1212
ENGINE = MergeTree
1313
ORDER BY tuple()

tests/queries/0_stateless/02346_gin_index_bug52019.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ DROP TABLE IF EXISTS tab;
88
CREATE TABLE tab (
99
id UInt64,
1010
str Map(String, String),
11-
INDEX idx mapKeys(str) TYPE gin(2) GRANULARITY 1)
11+
INDEX idx mapKeys(str) TYPE gin(tokenizer = 'ngram', ngram_size = 2) GRANULARITY 1)
1212
ENGINE = MergeTree
1313
ORDER BY id
1414
SETTINGS index_granularity = 2, index_granularity_bytes = '10Mi',

tests/queries/0_stateless/02346_gin_index_bug54541.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ CREATE TABLE tab
88
(
99
id UInt32,
1010
str String,
11-
INDEX idx str TYPE gin
11+
INDEX idx str TYPE gin(tokenizer = 'default')
1212
)
1313
ENGINE = MergeTree
1414
ORDER BY id

tests/queries/0_stateless/02346_gin_index_bug59039.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ CREATE TABLE tab
99
(
1010
id UInt64,
1111
doc String,
12-
INDEX text_idx doc TYPE gin
12+
INDEX text_idx doc TYPE gin(tokenizer = 'default')
1313
)
1414
ENGINE = MergeTree
1515
ORDER BY id
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Must not have no arguments.
2+
Test single tokenizer argument.
3+
-- tokenizer must be default, ngram or noop.
4+
Test ngram_size argument.
5+
-- ngram size must be between 2 and 8.
6+
Test max_rows_per_postings_list argument.
7+
-- max_rows_per_posting_list is set to unlimited rows.
8+
-- max_rows_per_posting_list should be at least 8192.
9+
Parameters are shuffled.
10+
Types are incorrect.
11+
Same argument appears >1 times.
12+
Must be created on single column.
13+
Must be created on String or FixedString or Array(String) or Array(FixedString) or LowCardinality(String) or LowCardinality(FixedString) columns.

0 commit comments

Comments
 (0)