apache · zzzxl1993 · Nov 24, 2025 · Nov 25, 2025
diff --git a/docs/ai/text-search/custom-analyzer.md b/docs/ai/text-search/custom-analyzer.md
@@ -29,6 +29,11 @@ PROPERTIES (
 - Parameters
   - `char_filter_pattern`: characters to replace
   - `char_filter_replacement`: replacement characters (default: space)
+`icu_normalizer`: Preprocess text using ICU normalization.
+- Parameters
+  - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd`
+  - `mode`: Normalization mode (default `compose`). Options: `compose`, `decompose`
+  - `unicode_set_filter`: Specify the character set to normalize (e.g. `[a-z]`)
 
 #### 2. Creating a tokenizer
 
@@ -77,6 +82,9 @@ Available token filters:
 - **ascii_folding**: Converts non-ASCII characters to ASCII equivalents
 - **lowercase**: Converts tokens to lowercase
 - **pinyin**: Converts Chinese characters to pinyin after tokenization. For parameter details, refer to the **pinyin** tokenizer above.
+- **icu_normalizer**: Process tokens using ICU normalization.
+  - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd`
+  - `unicode_set_filter`: Specify the character set to normalize
 
 #### 4. Creating an analyzer
 

diff --git a/docs/ai/text-search/custom-normalizer.md b/docs/ai/text-search/custom-normalizer.md
@@ -0,0 +1,107 @@
+---
+{
+    "title": "Custom Normalizer",
+    "language": "en"
+}
+---
+
+## Overview
+
+Custom Normalizer is used for unified text preprocessing, typically in scenarios that do not require tokenization but need normalization (such as keyword search). Unlike an Analyzer, a Normalizer does not split text but processes the entire text as a single complete Token. It supports combining character filters and token filters to achieve functions like case conversion and character normalization.
+
+## Using Custom Normalizer
+
+### Create
+
+A custom normalizer consists mainly of character filters (`char_filter`) and token filters (`token_filter`).
+
+> Note: For detailed creation methods of `char_filter` and `token_filter`, please refer to the [Custom Analyzer] documentation.
+
+```sql
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
+PROPERTIES (
+  "char_filter" = "x_char_filter",          -- Optional, one or more character filters
+  "token_filter" = "x_filter1, x_filter2"   -- Optional, one or more token filters, executed in order
+);
+```
+
+### View
+
+```sql
+SHOW INVERTED INDEX NORMALIZER;
+```
+
+### Drop
+
+```sql
+DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
+```
+
+## Usage in Table Creation
+
+Specify the custom normalizer using `normalizer` in the inverted index properties.
+
+**Note**: `normalizer` and `analyzer` are mutually exclusive and cannot be specified in the same index simultaneously.
+
+```sql
+CREATE TABLE tbl (
+    `id` bigint NOT NULL,
+    `code` text NULL,
+    INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer")
+)
+...
+```
+
+## Limitations
+
+1. The names referenced in `char_filter` and `token_filter` must exist (either built-in or created).
+2. A normalizer can only be dropped if no table is using it.
+3. A `char_filter` or `token_filter` can only be dropped if no normalizer is using it.
+4. After using the custom normalizer syntax, it takes about 10 seconds to sync to the BE, after which import operations will function normally without errors.
+
+## Complete Example
+
+### Example: Ignoring Case and Special Accents
+
+This example demonstrates how to create a normalizer that converts text to lowercase and removes accents (e.g., normalizing `Café` to `cafe`), suitable for exact matching that is case-insensitive and accent-insensitive.
+
+```sql
+-- 1. Create a custom token filter (if specific parameters are needed)
+-- Create an ascii_folding filter here
+CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
+PROPERTIES
+(
+    "type" = "ascii_folding",
+    "preserve_original" = "false"
+);
+
+-- 2. Create the normalizer
+-- Combine lowercase (built-in) and my_ascii_folding
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
+PROPERTIES
+(
+    "token_filter" = "lowercase, my_ascii_folding"
+);
+
+-- 3. Use in table creation
+CREATE TABLE product_table (
+    `id` bigint NOT NULL,
+    `product_name` text NULL,
+    INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer")
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY RANDOM BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+
+-- 4. Verify and test
+select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
+```
+
+Result:
+```json
+[
+  {"token":"cafe-products"}
+]
+```
@@ -0,0 +1,136 @@
+---
+{
+    "title": "UNICODE_NORMALIZE",
+    "language": "en"
+}
+---
+
+## Description
+
+Performs [Unicode Normalization](https://unicode-org.github.io/icu/userguide/transforms/normalization/) on the input string.
+
+Unicode normalization is the process of converting equivalent Unicode character sequences into a unified form. For example, the character "é" can be represented by a single code point (U+00E9) or by "e" + a combining acute accent (U+0065 + U+0301). Normalization ensures that these equivalent representations are handled uniformly.
+
+## Syntax
+
+```sql
+UNICODE_NORMALIZE(<str>, <mode>)
+```
+
+## Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `<str>` | The input string to be normalized. Type: VARCHAR |
+| `<mode>` | The normalization mode, must be a constant string (case-insensitive). Supported modes:<br/>- `NFC`: Canonical Decomposition, followed by Canonical Composition<br/>- `NFD`: Canonical Decomposition<br/>- `NFKC`: Compatibility Decomposition, followed by Canonical Composition<br/>- `NFKD`: Compatibility Decomposition<br/>- `NFKC_CF`: NFKC followed by Case Folding |
+
+## Return Value
+
+Returns VARCHAR type, representing the normalized result of the input string.
+
+## Examples
+
+1. Difference between NFC and NFD (composed vs decomposed characters)
+
+```sql
+-- 'Café' where é may be in composed form, NFD will decompose it into e + combining accent
+SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len;
+```
+
+```text
++---------+---------+
+| nfc_len | nfd_len |
++---------+---------+
+|       4 |       5 |
++---------+---------+
+```
+
+2. NFKC_CF for case folding
+
+```sql
+SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
+```
+
+```text
++---------+
+| result  |
++---------+
+| abc 123 |
++---------+
+```
+
+3. NFKC handling fullwidth characters (compatibility decomposition)
+
+```sql
+-- Fullwidth digits '１２３' will be converted to halfwidth '123'
+SELECT unicode_normalize('１２３ＡＢＣ', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123ABC |
++--------+
+```
+
+4. NFKD handling special symbols (compatibility decomposition)
+
+```sql
+-- ℃ (degree Celsius symbol) will be decomposed to °C
+SELECT unicode_normalize('25℃', 'NFKD') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 25°C   |
++--------+
+```
+
+5. Handling circled numbers
+
+```sql
+-- ① ② ③ circled numbers will be converted to regular digits
+SELECT unicode_normalize('①②③', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123    |
++--------+
+```
+
+6. Comparing different modes on the same string
+
+```sql
+SELECT 
+    unicode_normalize('ﬁ', 'NFC') AS nfc_result,
+    unicode_normalize('ﬁ', 'NFKC') AS nfkc_result;
+```
+
+```text
++------------+-------------+
+| nfc_result | nfkc_result |
++------------+-------------+
+| ﬁ          | fi          |
++------------+-------------+
+```
+
+7. String equality comparison scenario
+
+```sql
+-- Use normalization to compare visually identical but differently encoded strings
+SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal;
+```
+
+```text
++----------+
+| is_equal |
++----------+
+|        1 |
++----------+
+```
diff --git a/.../zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md b/.../zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
@@ -29,6 +29,11 @@ PROPERTIES (
 - 参数
   - `char_filter_pattern`：需要替换的字符列表
   - `char_filter_replacement`：替换后的字符（默认空格）
+`icu_normalizer`：使用 ICU 标准化对文本进行预处理。
+- 参数
+  - `name`：标准化形式（默认 `nfkc_cf`）。可选：`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
+  - `mode`：标准化模式（默认 `compose`）。可选：`compose`（组合）、`decompose`（分解）
+  - `unicode_set_filter`：指定需要标准化的字符集（如 `[a-z]`）
 
 #### 2. tokenizer（分词器）
 
@@ -81,6 +86,9 @@ PROPERTIES (
     - `type_table`：自定义字符类型映射（如 `[+ => ALPHA, - => ALPHA]`），类型含 `ALPHA`、`ALPHANUM`、`DIGIT`、`LOWER`、`SUBWORD_DELIM`、`UPPER`
 - `ascii_folding`：将非 ASCII 字符映射为等效 ASCII
 - `lowercase`：将 token 文本转为小写
+- `icu_normalizer`：使用 ICU 标准化对词元进行处理。
+  - `name`：标准化形式（默认 `nfkc_cf`）。可选：`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
+  - `unicode_set_filter`：指定需要标准化的字符集
 
 #### 4. analyzer（分析器）
 

diff --git a/...h-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md b/...h-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md
@@ -0,0 +1,107 @@
+---
+{
+"title": "自定义标准化",
+    "language": "zh-CN"
+}
+---
+
+## 概述
+
+自定义标准化（Normalizer）用于对文本进行统一的预处理，通常用于不需要分词但需要标准化的场景（如关键字搜索）。与分词器（Analyzer）不同，Normalizer 不会对文本进行切分，而是将整个文本作为一个完整的词项（Token）进行处理，支持组合字符过滤器和词元过滤器，以实现大小写转换、字符归一化等功能。
+
+## 使用自定义标准化
+
+### 创建
+
+自定义标准化器主要由字符过滤器（char_filter）和词元过滤器（token_filter）组成。
+
+> 注意：`char_filter` 和 `token_filter` 的详细创建方式请参考[自定义分词]文档。
+
+```sql
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
+PROPERTIES (
+  "char_filter" = "x_char_filter",          -- 可选，一个或多个字符过滤器
+  "token_filter" = "x_filter1, x_filter2"   -- 可选，一个或多个词元过滤器，按顺序执行
+);
+```
+
+### 查看
+
+```sql
+SHOW INVERTED INDEX NORMALIZER;
+```
+
+### 删除
+
+```sql
+DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
+```
+
+## 建表中使用自定义标准化
+
+在倒排索引属性中使用 `normalizer` 指定自定义标准化器。
+
+**注意**：`normalizer` 与 `analyzer` 互斥，不能同时在同一个索引中指定。
+
+```sql
+CREATE TABLE tbl (
+    `id` bigint NOT NULL,
+    `code` text NULL,
+    INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer")
+)
+...
+```
+
+## 使用限制
+
+1. `char_filter` 和 `token_filter` 中引用的名称必须存在（内置或已创建）。
+2. 只有在没有任何表使用 normalizer 的时候才能删除它。
+3. 只有在没有任何 normalizer 使用 char_filter 或 token_filter 的情况下才能删除对应的 filter。
+4. 使用自定义标准化语法 10s 后会被同步到 BE，之后导入正常不会报错。
+
+## 完整示例
+
+### 示例：忽略大小写与特殊重音符号
+
+本示例展示如何创建一个标准化器，将文本转换为小写并移除重音符号（例如将 `Café` 标准化为 `cafe`），适用于不区分大小写和重音的精确匹配。
+
+```sql
+-- 1. 创建自定义词元过滤器（如果需要特定参数）
+-- 此处创建一个 ascii_folding 过滤器
+CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
+PROPERTIES
+(
+    "type" = "ascii_folding",
+    "preserve_original" = "false"
+);
+
+-- 2. 创建标准化器
+-- 组合使用 lowercase（内置）和 my_ascii_folding
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
+PROPERTIES
+(
+    "token_filter" = "lowercase, my_ascii_folding"
+);
+
+-- 3. 建表使用
+CREATE TABLE product_table (
+    `id` bigint NOT NULL,
+    `product_name` text NULL,
+    INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer")
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY RANDOM BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+
+-- 4. 验证测试
+select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
+```
+
+返回结果：
+```json
+[
+  {"token":"cafe-products"}
+]
+```