Skip to content

Commit cfe05ec

Browse files
authored
Merge pull request #31121 from ishansrivastava90/token-maxlength-note
[Azure Search] making token length note more visible
2 parents 0828d62 + 01c4ae1 commit cfe05ec

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

articles/search/index-add-custom-analyzers.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,9 @@ For analyzers, index attributes vary depending on the whether you're using prede
221221
|Tokenizer|Required. Set to either one of predefined tokenizers listed in the [Tokenizers](#Tokenizers) table below or a custom tokenizer specified in the index definition.|
222222
|TokenFilters|Set to either one of predefined token filters listed in the [Token filters](#TokenFilters) table or a custom token filter specified in the index definition.|
223223

224+
> [!NOTE]
225+
> It's required that you configure your custom analyzer to not produce tokens longer than 300 characters. Indexing fails for documents with such tokens. To trim them or ignore them, use the **TruncateTokenFilter** and the **LengthTokenFilter** respectively. Check [**Token filters**](#TokenFilters) for reference.
226+
224227
<a name="CharFilter"></a>
225228

226229
### Char Filters
@@ -313,7 +316,7 @@ In the table below, the tokenizers that are implemented using Apache Lucene are
313316
| microsoft_language_stemming_tokenizer | MicrosoftLanguageStemmingTokenizer| Divides text using language-specific rules and reduces words to their base forms<br /><br /> **Options**<br /><br />maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. Tokens longer than the maximum length are split. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.<br /><br /> isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.<br /><br /> language (type: string) - Language to use, default "english". Allowed values include:<br />"arabic", "bangla", "bulgarian", "catalan", "croatian", "czech", "danish", "dutch", "english", "estonian", "finnish", "french", "german", "greek", "gujarati", "hebrew", "hindi", "hungarian", "icelandic", "indonesian", "italian", "kannada", "latvian", "lithuanian", "malay", "malayalam", "marathi", "norwegianBokmaal", "polish", "portuguese", "portugueseBrazilian", "punjabi", "romanian", "russian", "serbianCyrillic", "serbianLatin", "slovak", "slovenian", "spanish", "swedish", "tamil", "telugu", "turkish", "ukrainian", "urdu" |
314317
|[nGram](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html)|NGramTokenizer|Tokenizes the input into n-grams of the given size(s).<br /><br /> **Options**<br /><br /> minGram (type: int) - Default: 1, maximum: 300.<br /><br /> maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram. <br /><br /> tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values: "letter", "digit", "whitespace", "punctuation", "symbol". Defaults to an empty array - keeps all characters. |
315318
|[path_hierarchy_v2](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html)|PathHierarchyTokenizerV2|Tokenizer for path-like hierarchies.<br /><br /> **Options**<br /><br /> delimiter (type: string) - Default: '/.<br /><br /> replacement (type: string) - If set, replaces the delimiter character. Default same as the value of delimiter.<br /><br /> maxTokenLength (type: int) - The maximum token length. Default: 300, maximum: 300. Paths longer than maxTokenLength are ignored.<br /><br /> reverse (type: bool) - If true, generates token in reverse order. Default: false.<br /><br /> skip (type: bool) - Initial tokens to skip. The default is 0.|
316-
|[pattern](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html)|PatternTokenizer|This tokenizer uses regex pattern matching to construct distinct tokens.<br /><br /> **Options**<br /><br /> pattern (type: string) - Regular expression pattern. The default is \w+.<br /><br /> [flags](https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary) (type: string) - Regular expression flags. The default is an empty string. Allowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES<br /><br /> group (type: int) - Which group to extract into tokens. The default is -1 (split).|
319+
|[pattern](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html)|PatternTokenizer|This tokenizer uses regex pattern matching to construct distinct tokens.<br /><br /> **Options**<br /><br /> [pattern](https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html) (type: string) - Regular expression pattern. The default is \W+. <br /><br /> [flags](https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary) (type: string) - Regular expression flags. The default is an empty string. Allowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES<br /><br /> group (type: int) - Which group to extract into tokens. The default is -1 (split).|
317320
|[standard_v2](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html)|StandardTokenizerV2|Breaks text following the [Unicode Text Segmentation rules](https://unicode.org/reports/tr29/).<br /><br /> **Options**<br /><br /> maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.|
318321
|[uax_url_email](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html)|UaxUrlEmailTokenizer|Tokenizes urls and emails as one token.<br /><br /> **Options**<br /><br /> maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.|
319322
|[whitespace](https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html)|(type applies only when options are available) |Divides text at whitespace. Tokens that are longer than 255 characters are split.|
@@ -372,9 +375,6 @@ In the table below, the token filters that are implemented using Apache Lucene a
372375

373376
<sup>1</sup> Token Filter Types are always prefixed in code with "#Microsoft.Azure.Search" such that "ArabicNormalizationTokenFilter" would actually be specified as "#Microsoft.Azure.Search.ArabicNormalizationTokenFilter". We removed the prefix to reduce the width of the table, but please remember to include it in your code.
374377

375-
> [!NOTE]
376-
> It's required that you configure your custom analyzer to not produce tokens longer than 300 characters. Indexing fails for documents with such tokens. To trim them or ignore them, use the **TruncateTokenFilter** and the **LengthTokenFilter** respectively.
377-
378378

379379
## See also
380380
[Azure Search Service REST](https://docs.microsoft.com/rest/api/searchservice/)

0 commit comments

Comments
 (0)