diff --git a/docs/ai/text-search/custom-analyzer.md b/docs/ai/text-search/custom-analyzer.md index f5fd87820fb7d..39523b41c7ec6 100644 --- a/docs/ai/text-search/custom-analyzer.md +++ b/docs/ai/text-search/custom-analyzer.md @@ -29,6 +29,11 @@ PROPERTIES ( - Parameters - `char_filter_pattern`: characters to replace - `char_filter_replacement`: replacement characters (default: space) +`icu_normalizer`: Preprocess text using ICU normalization. +- Parameters + - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd` + - `mode`: Normalization mode (default `compose`). Options: `compose`, `decompose` + - `unicode_set_filter`: Specify the character set to normalize (e.g. `[a-z]`) #### 2. Creating a tokenizer @@ -77,6 +82,9 @@ Available token filters: - **ascii_folding**: Converts non-ASCII characters to ASCII equivalents - **lowercase**: Converts tokens to lowercase - **pinyin**: Converts Chinese characters to pinyin after tokenization. For parameter details, refer to the **pinyin** tokenizer above. +- **icu_normalizer**: Process tokens using ICU normalization. + - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd` + - `unicode_set_filter`: Specify the character set to normalize #### 4. Creating an analyzer diff --git a/docs/ai/text-search/custom-normalizer.md b/docs/ai/text-search/custom-normalizer.md new file mode 100644 index 0000000000000..985b1853c6479 --- /dev/null +++ b/docs/ai/text-search/custom-normalizer.md @@ -0,0 +1,107 @@ +--- +{ + "title": "Custom Normalizer", + "language": "en" +} +--- + +## Overview + +Custom Normalizer is used for unified text preprocessing, typically in scenarios that do not require tokenization but need normalization (such as keyword search). Unlike an Analyzer, a Normalizer does not split text but processes the entire text as a single complete Token. It supports combining character filters and token filters to achieve functions like case conversion and character normalization. + +## Using Custom Normalizer + +### Create + +A custom normalizer consists mainly of character filters (`char_filter`) and token filters (`token_filter`). + +> Note: For detailed creation methods of `char_filter` and `token_filter`, please refer to the [Custom Analyzer] documentation. + +```sql +CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer +PROPERTIES ( + "char_filter" = "x_char_filter", -- Optional, one or more character filters + "token_filter" = "x_filter1, x_filter2" -- Optional, one or more token filters, executed in order +); +``` + +### View + +```sql +SHOW INVERTED INDEX NORMALIZER; +``` + +### Drop + +```sql +DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer; +``` + +## Usage in Table Creation + +Specify the custom normalizer using `normalizer` in the inverted index properties. + +**Note**: `normalizer` and `analyzer` are mutually exclusive and cannot be specified in the same index simultaneously. + +```sql +CREATE TABLE tbl ( + `id` bigint NOT NULL, + `code` text NULL, + INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer") +) +... +``` + +## Limitations + +1. The names referenced in `char_filter` and `token_filter` must exist (either built-in or created). +2. A normalizer can only be dropped if no table is using it. +3. A `char_filter` or `token_filter` can only be dropped if no normalizer is using it. +4. After using the custom normalizer syntax, it takes about 10 seconds to sync to the BE, after which import operations will function normally without errors. + +## Complete Example + +### Example: Ignoring Case and Special Accents + +This example demonstrates how to create a normalizer that converts text to lowercase and removes accents (e.g., normalizing `Café` to `cafe`), suitable for exact matching that is case-insensitive and accent-insensitive. + +```sql +-- 1. Create a custom token filter (if specific parameters are needed) +-- Create an ascii_folding filter here +CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding +PROPERTIES +( + "type" = "ascii_folding", + "preserve_original" = "false" +); + +-- 2. Create the normalizer +-- Combine lowercase (built-in) and my_ascii_folding +CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer +PROPERTIES +( + "token_filter" = "lowercase, my_ascii_folding" +); + +-- 3. Use in table creation +CREATE TABLE product_table ( + `id` bigint NOT NULL, + `product_name` text NULL, + INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer") +) ENGINE=OLAP +DUPLICATE KEY(`id`) +DISTRIBUTED BY RANDOM BUCKETS 1 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 1" +); + +-- 4. Verify and test +select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"'); +``` + +Result: +```json +[ + {"token":"cafe-products"} +] +``` diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md new file mode 100644 index 0000000000000..8e39b826a9171 --- /dev/null +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md @@ -0,0 +1,136 @@ +--- +{ + "title": "UNICODE_NORMALIZE", + "language": "en" +} +--- + +## Description + +Performs [Unicode Normalization](https://unicode-org.github.io/icu/userguide/transforms/normalization/) on the input string. + +Unicode normalization is the process of converting equivalent Unicode character sequences into a unified form. For example, the character "é" can be represented by a single code point (U+00E9) or by "e" + a combining acute accent (U+0065 + U+0301). Normalization ensures that these equivalent representations are handled uniformly. + +## Syntax + +```sql +UNICODE_NORMALIZE(, ) +``` + +## Parameters + +| Parameter | Description | +|-----------|-------------| +| `` | The input string to be normalized. Type: VARCHAR | +| `` | The normalization mode, must be a constant string (case-insensitive). Supported modes:
- `NFC`: Canonical Decomposition, followed by Canonical Composition
- `NFD`: Canonical Decomposition
- `NFKC`: Compatibility Decomposition, followed by Canonical Composition
- `NFKD`: Compatibility Decomposition
- `NFKC_CF`: NFKC followed by Case Folding | + +## Return Value + +Returns VARCHAR type, representing the normalized result of the input string. + +## Examples + +1. Difference between NFC and NFD (composed vs decomposed characters) + +```sql +-- 'Café' where é may be in composed form, NFD will decompose it into e + combining accent +SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len; +``` + +```text ++---------+---------+ +| nfc_len | nfd_len | ++---------+---------+ +| 4 | 5 | ++---------+---------+ +``` + +2. NFKC_CF for case folding + +```sql +SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result; +``` + +```text ++---------+ +| result | ++---------+ +| abc 123 | ++---------+ +``` + +3. NFKC handling fullwidth characters (compatibility decomposition) + +```sql +-- Fullwidth digits '123' will be converted to halfwidth '123' +SELECT unicode_normalize('123ABC', 'NFKC') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 123ABC | ++--------+ +``` + +4. NFKD handling special symbols (compatibility decomposition) + +```sql +-- ℃ (degree Celsius symbol) will be decomposed to °C +SELECT unicode_normalize('25℃', 'NFKD') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 25°C | ++--------+ +``` + +5. Handling circled numbers + +```sql +-- ① ② ③ circled numbers will be converted to regular digits +SELECT unicode_normalize('①②③', 'NFKC') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 123 | ++--------+ +``` + +6. Comparing different modes on the same string + +```sql +SELECT + unicode_normalize('fi', 'NFC') AS nfc_result, + unicode_normalize('fi', 'NFKC') AS nfkc_result; +``` + +```text ++------------+-------------+ +| nfc_result | nfkc_result | ++------------+-------------+ +| fi | fi | ++------------+-------------+ +``` + +7. String equality comparison scenario + +```sql +-- Use normalization to compare visually identical but differently encoded strings +SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal; +``` + +```text ++----------+ +| is_equal | ++----------+ +| 1 | ++----------+ +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md index 69b5749dd19be..aff791eb211ae 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md @@ -29,6 +29,11 @@ PROPERTIES ( - 参数 - `char_filter_pattern`:需要替换的字符列表 - `char_filter_replacement`:替换后的字符(默认空格) +`icu_normalizer`:使用 ICU 标准化对文本进行预处理。 +- 参数 + - `name`:标准化形式(默认 `nfkc_cf`)。可选:`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd` + - `mode`:标准化模式(默认 `compose`)。可选:`compose`(组合)、`decompose`(分解) + - `unicode_set_filter`:指定需要标准化的字符集(如 `[a-z]`) #### 2. tokenizer(分词器) @@ -81,6 +86,9 @@ PROPERTIES ( - `type_table`:自定义字符类型映射(如 `[+ => ALPHA, - => ALPHA]`),类型含 `ALPHA`、`ALPHANUM`、`DIGIT`、`LOWER`、`SUBWORD_DELIM`、`UPPER` - `ascii_folding`:将非 ASCII 字符映射为等效 ASCII - `lowercase`:将 token 文本转为小写 +- `icu_normalizer`:使用 ICU 标准化对词元进行处理。 + - `name`:标准化形式(默认 `nfkc_cf`)。可选:`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd` + - `unicode_set_filter`:指定需要标准化的字符集 #### 4. analyzer(分析器) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md new file mode 100644 index 0000000000000..22ad616647686 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md @@ -0,0 +1,107 @@ +--- +{ +"title": "自定义标准化", + "language": "zh-CN" +} +--- + +## 概述 + +自定义标准化(Normalizer)用于对文本进行统一的预处理,通常用于不需要分词但需要标准化的场景(如关键字搜索)。与分词器(Analyzer)不同,Normalizer 不会对文本进行切分,而是将整个文本作为一个完整的词项(Token)进行处理,支持组合字符过滤器和词元过滤器,以实现大小写转换、字符归一化等功能。 + +## 使用自定义标准化 + +### 创建 + +自定义标准化器主要由字符过滤器(char_filter)和词元过滤器(token_filter)组成。 + +> 注意:`char_filter` 和 `token_filter` 的详细创建方式请参考[自定义分词]文档。 + +```sql +CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer +PROPERTIES ( + "char_filter" = "x_char_filter", -- 可选,一个或多个字符过滤器 + "token_filter" = "x_filter1, x_filter2" -- 可选,一个或多个词元过滤器,按顺序执行 +); +``` + +### 查看 + +```sql +SHOW INVERTED INDEX NORMALIZER; +``` + +### 删除 + +```sql +DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer; +``` + +## 建表中使用自定义标准化 + +在倒排索引属性中使用 `normalizer` 指定自定义标准化器。 + +**注意**:`normalizer` 与 `analyzer` 互斥,不能同时在同一个索引中指定。 + +```sql +CREATE TABLE tbl ( + `id` bigint NOT NULL, + `code` text NULL, + INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer") +) +... +``` + +## 使用限制 + +1. `char_filter` 和 `token_filter` 中引用的名称必须存在(内置或已创建)。 +2. 只有在没有任何表使用 normalizer 的时候才能删除它。 +3. 只有在没有任何 normalizer 使用 char_filter 或 token_filter 的情况下才能删除对应的 filter。 +4. 使用自定义标准化语法 10s 后会被同步到 BE,之后导入正常不会报错。 + +## 完整示例 + +### 示例:忽略大小写与特殊重音符号 + +本示例展示如何创建一个标准化器,将文本转换为小写并移除重音符号(例如将 `Café` 标准化为 `cafe`),适用于不区分大小写和重音的精确匹配。 + +```sql +-- 1. 创建自定义词元过滤器(如果需要特定参数) +-- 此处创建一个 ascii_folding 过滤器 +CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding +PROPERTIES +( + "type" = "ascii_folding", + "preserve_original" = "false" +); + +-- 2. 创建标准化器 +-- 组合使用 lowercase(内置)和 my_ascii_folding +CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer +PROPERTIES +( + "token_filter" = "lowercase, my_ascii_folding" +); + +-- 3. 建表使用 +CREATE TABLE product_table ( + `id` bigint NOT NULL, + `product_name` text NULL, + INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer") +) ENGINE=OLAP +DUPLICATE KEY(`id`) +DISTRIBUTED BY RANDOM BUCKETS 1 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 1" +); + +-- 4. 验证测试 +select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"'); +``` + +返回结果: +```json +[ + {"token":"cafe-products"} +] +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md new file mode 100644 index 0000000000000..e712b115f2de2 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md @@ -0,0 +1,136 @@ +--- +{ + "title": "UNICODE_NORMALIZE", + "language": "zh-CN" +} +--- + +## 描述 + +对输入字符串进行 [Unicode 标准化(归一化)](https://unicode-org.github.io/icu/userguide/transforms/normalization/)。 + +Unicode 标准化是将等价的 Unicode 字符序列转换为统一形式的过程。例如,字符 "é" 可以用单个码点(U+00E9)表示,也可以用 "e" + 组合重音符号(U+0065 + U+0301)两个码点表示。标准化确保这些等价的表示形式能被统一处理。 + +## 语法 + +```sql +UNICODE_NORMALIZE(, ) +``` + +## 参数 + +| 参数 | 说明 | +|------|------| +| `` | 需要进行标准化的输入字符串。类型:VARCHAR | +| `` | 标准化模式,必须是常量字符串(不区分大小写)。支持的模式:
- `NFC`: 标准分解后进行标准组合(Canonical Decomposition, followed by Canonical Composition)
- `NFD`: 标准分解(Canonical Decomposition)
- `NFKC`: 兼容分解后进行标准组合(Compatibility Decomposition, followed by Canonical Composition)
- `NFKD`: 兼容分解(Compatibility Decomposition)
- `NFKC_CF`: NFKC 后进行大小写折叠(Case Folding) | + +## 返回值 + +返回 VARCHAR 类型,表示输入字符串标准化后的结果。 + +## 示例 + +1. NFC 与 NFD 的区别(组合字符 vs 分解字符) + +```sql +-- 'Café' 中 é 可能是组合形式,NFD 会将其分解为 e + 组合重音符 +SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len; +``` + +```text ++---------+---------+ +| nfc_len | nfd_len | ++---------+---------+ +| 4 | 5 | ++---------+---------+ +``` + +2. NFKC_CF 进行大小写折叠 + +```sql +SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result; +``` + +```text ++---------+ +| result | ++---------+ +| abc 123 | ++---------+ +``` + +3. NFKC 处理全角字符(兼容分解) + +```sql +-- 全角数字 '123' 会被转换为半角 '123' +SELECT unicode_normalize('123ABC', 'NFKC') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 123ABC | ++--------+ +``` + +4. NFKD 处理特殊符号(兼容分解) + +```sql +-- ℃ (摄氏度符号) 会被分解为 °C +SELECT unicode_normalize('25℃', 'NFKD') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 25°C | ++--------+ +``` + +5. 处理带圈数字 + +```sql +-- ① ② ③ 等带圈数字会被转换为普通数字 +SELECT unicode_normalize('①②③', 'NFKC') AS result; +``` + +```text ++--------+ +| result | ++--------+ +| 123 | ++--------+ +``` + +6. 比较不同模式对同一字符串的处理 + +```sql +SELECT + unicode_normalize('fi', 'NFC') AS nfc_result, + unicode_normalize('fi', 'NFKC') AS nfkc_result; +``` + +```text ++------------+-------------+ +| nfc_result | nfkc_result | ++------------+-------------+ +| fi | fi | ++------------+-------------+ +``` + +7. 字符串相等性比较场景 + +```sql +-- 使用标准化来比较视觉上相同但编码不同的字符串 +SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal; +``` + +```text ++----------+ +| is_equal | ++----------+ +| 1 | ++----------+ +``` diff --git a/sidebars.ts b/sidebars.ts index 9e415a60d2b2e..9a9832876ff26 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -400,6 +400,7 @@ const sidebars: SidebarsConfig = { 'ai/text-search/search-operators', 'ai/text-search/search-function', 'ai/text-search/custom-analyzer', + 'ai/text-search/custom-normalizer', 'ai/text-search/scoring', ], }, @@ -1338,6 +1339,7 @@ const sidebars: SidebarsConfig = { 'sql-manual/sql-functions/scalar-functions/string-functions/uncompress', 'sql-manual/sql-functions/scalar-functions/string-functions/unhex', 'sql-manual/sql-functions/scalar-functions/string-functions/ucase', + 'sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize', 'sql-manual/sql-functions/scalar-functions/string-functions/url-decode', 'sql-manual/sql-functions/scalar-functions/string-functions/url-encode', 'sql-manual/sql-functions/scalar-functions/string-functions/uuid',