You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/engines/table-engines/mergetree-family/invertedindexes.md
+15-9Lines changed: 15 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,28 +54,34 @@ CREATE TABLE tab
54
54
(
55
55
`key` UInt64,
56
56
`str` String,
57
-
INDEX inv_idx(str) TYPE gin(0) GRANULARITY 1
57
+
INDEX inv_idx(str) TYPE gin(tokenizer ='default|ngram|noop' [, ngram_size = N] [, max_rows_per_postings_list = M]) GRANULARITY 1
58
58
)
59
59
ENGINE = MergeTree
60
60
ORDER BY key
61
61
```
62
62
63
-
where `N` specifies the tokenizer:
63
+
where `tokenizer` specifies the tokenizer:
64
64
65
-
-`gin(0)` (or shorter: `gin()`) set the tokenizer to "tokens", i.e. split strings along spaces,
66
-
-`gin(N)` with `N` between 2 and 8 sets the tokenizer to "ngrams(N)"
65
+
-`default` set the tokenizer to "tokens('default')", i.e. split strings along non-alphanumeric characters.
66
+
-`ngram` set the tokenizer to "tokens('ngram')". i.e. splits strings to equal size terms.
67
+
-`noop` set the tokenizer to "tokens('noop')", i.e. every value itself is a term.
67
68
68
-
The maximum rows per postings list can be specified as the second parameter. This parameter can be used to control postings list sizes to avoid generating huge postings list files. The following variants exist:
69
+
The ngram size can be specified via the `ngram_size` parameter. This is an optional parameter. The following variants exist:
69
70
70
-
-`gin(ngrams, max_rows_per_postings_list)`: Use given max_rows_per_postings_list (assuming it is not 0)
71
-
-`gin(ngrams, 0)`: No limitation of maximum rows per postings list
72
-
-`gin(ngrams)`: Use a default maximum rows which is 64K.
71
+
-`ngram_size = N`: with `N` between 2 and 8 sets the tokenizer to "tokens('ngram', N)".
72
+
- If not specified: Use a default ngram size which is 3.
73
+
74
+
The maximum rows per postings list can be specified via an optional `max_rows_per_postings_list`. This parameter can be used to control postings list sizes to avoid generating huge postings list files. The following variants exist:
75
+
76
+
-`max_rows_per_postings_list = 0`: No limitation of maximum rows per postings list.
77
+
-`max_rows_per_postings_list = M`: with `M` should be at least 8192.
78
+
- If not specified: Use a default maximum rows which is 64K.
73
79
74
80
Being a type of skipping index, full-text indexes can be dropped or added to a column after table creation:
75
81
76
82
```sql
77
83
ALTERTABLE tab DROP INDEX inv_idx;
78
-
ALTERTABLE tab ADD INDEX inv_idx(s) TYPE gin(2);
84
+
ALTERTABLE tab ADD INDEX inv_idx(s) TYPE gin(tokenizer ='default');
79
85
```
80
86
81
87
To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As
"GIN index '{}' argument supports only 'default', 'ngram', and 'noop', but got {}",
849
+
ARGUMENT_TOKENIZER,
850
+
tokenizer.value());
851
+
852
+
if (tokenizer.value() == NgramTokenExtractor::getExternalName())
789
853
{
790
-
if (index.arguments[1].getType() != Field::Types::UInt64)
791
-
throwException(ErrorCodes::INCORRECT_QUERY, "Second argument of GIN index (max_rows_per_postings_list) must be of type UInt64");
792
-
if (index.arguments[1].safeGet<UInt64>() != UNLIMITED_ROWS_PER_POSTINGS_LIST && index.arguments[1].safeGet<UInt64>() < MIN_ROWS_PER_POSTINGS_LIST)
793
-
throwException(ErrorCodes::INCORRECT_QUERY, "Second argument of GIN index (max_rows_per_postings_list) must not be less than {}", MIN_ROWS_PER_POSTINGS_LIST);
0 commit comments