Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/ai/text-search/custom-analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ PROPERTIES (
- Parameters
- `char_filter_pattern`: characters to replace
- `char_filter_replacement`: replacement characters (default: space)
`icu_normalizer`: Preprocess text using ICU normalization.
- Parameters
- `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd`
- `mode`: Normalization mode (default `compose`). Options: `compose`, `decompose`
- `unicode_set_filter`: Specify the character set to normalize (e.g. `[a-z]`)

#### 2. Creating a tokenizer

Expand Down Expand Up @@ -77,6 +82,9 @@ Available token filters:
- **ascii_folding**: Converts non-ASCII characters to ASCII equivalents
- **lowercase**: Converts tokens to lowercase
- **pinyin**: Converts Chinese characters to pinyin after tokenization. For parameter details, refer to the **pinyin** tokenizer above.
- **icu_normalizer**: Process tokens using ICU normalization.
- `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, `nfkc_cf`, `nfd`, `nfkd`
- `unicode_set_filter`: Specify the character set to normalize

#### 4. Creating an analyzer

Expand Down
107 changes: 107 additions & 0 deletions docs/ai/text-search/custom-normalizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
{
"title": "Custom Normalizer",
"language": "en"
}
---

## Overview

Custom Normalizer is used for unified text preprocessing, typically in scenarios that do not require tokenization but need normalization (such as keyword search). Unlike an Analyzer, a Normalizer does not split text but processes the entire text as a single complete Token. It supports combining character filters and token filters to achieve functions like case conversion and character normalization.

## Using Custom Normalizer

### Create

A custom normalizer consists mainly of character filters (`char_filter`) and token filters (`token_filter`).

> Note: For detailed creation methods of `char_filter` and `token_filter`, please refer to the [Custom Analyzer] documentation.

```sql
CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
PROPERTIES (
"char_filter" = "x_char_filter", -- Optional, one or more character filters
"token_filter" = "x_filter1, x_filter2" -- Optional, one or more token filters, executed in order
);
```

### View

```sql
SHOW INVERTED INDEX NORMALIZER;
```

### Drop

```sql
DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
```

## Usage in Table Creation

Specify the custom normalizer using `normalizer` in the inverted index properties.

**Note**: `normalizer` and `analyzer` are mutually exclusive and cannot be specified in the same index simultaneously.

```sql
CREATE TABLE tbl (
`id` bigint NOT NULL,
`code` text NULL,
INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer")
)
...
```

## Limitations

1. The names referenced in `char_filter` and `token_filter` must exist (either built-in or created).
2. A normalizer can only be dropped if no table is using it.
3. A `char_filter` or `token_filter` can only be dropped if no normalizer is using it.
4. After using the custom normalizer syntax, it takes about 10 seconds to sync to the BE, after which import operations will function normally without errors.

## Complete Example

### Example: Ignoring Case and Special Accents

This example demonstrates how to create a normalizer that converts text to lowercase and removes accents (e.g., normalizing `Café` to `cafe`), suitable for exact matching that is case-insensitive and accent-insensitive.

```sql
-- 1. Create a custom token filter (if specific parameters are needed)
-- Create an ascii_folding filter here
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
PROPERTIES
(
"type" = "ascii_folding",
"preserve_original" = "false"
);

-- 2. Create the normalizer
-- Combine lowercase (built-in) and my_ascii_folding
CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
PROPERTIES
(
"token_filter" = "lowercase, my_ascii_folding"
);

-- 3. Use in table creation
CREATE TABLE product_table (
`id` bigint NOT NULL,
`product_name` text NULL,
INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer")
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

-- 4. Verify and test
select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
```

Result:
```json
[
{"token":"cafe-products"}
]
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
{
"title": "UNICODE_NORMALIZE",
"language": "en"
}
---

## Description

Performs [Unicode Normalization](https://unicode-org.github.io/icu/userguide/transforms/normalization/) on the input string.

Unicode normalization is the process of converting equivalent Unicode character sequences into a unified form. For example, the character "é" can be represented by a single code point (U+00E9) or by "e" + a combining acute accent (U+0065 + U+0301). Normalization ensures that these equivalent representations are handled uniformly.

## Syntax

```sql
UNICODE_NORMALIZE(<str>, <mode>)
```

## Parameters

| Parameter | Description |
|-----------|-------------|
| `<str>` | The input string to be normalized. Type: VARCHAR |
| `<mode>` | The normalization mode, must be a constant string (case-insensitive). Supported modes:<br/>- `NFC`: Canonical Decomposition, followed by Canonical Composition<br/>- `NFD`: Canonical Decomposition<br/>- `NFKC`: Compatibility Decomposition, followed by Canonical Composition<br/>- `NFKD`: Compatibility Decomposition<br/>- `NFKC_CF`: NFKC followed by Case Folding |

## Return Value

Returns VARCHAR type, representing the normalized result of the input string.

## Examples

1. Difference between NFC and NFD (composed vs decomposed characters)

```sql
-- 'Café' where é may be in composed form, NFD will decompose it into e + combining accent
SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len;
```

```text
+---------+---------+
| nfc_len | nfd_len |
+---------+---------+
| 4 | 5 |
+---------+---------+
```

2. NFKC_CF for case folding

```sql
SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
```

```text
+---------+
| result |
+---------+
| abc 123 |
+---------+
```

3. NFKC handling fullwidth characters (compatibility decomposition)

```sql
-- Fullwidth digits '123' will be converted to halfwidth '123'
SELECT unicode_normalize('123ABC', 'NFKC') AS result;
```

```text
+--------+
| result |
+--------+
| 123ABC |
+--------+
```

4. NFKD handling special symbols (compatibility decomposition)

```sql
-- ℃ (degree Celsius symbol) will be decomposed to °C
SELECT unicode_normalize('25℃', 'NFKD') AS result;
```

```text
+--------+
| result |
+--------+
| 25°C |
+--------+
```

5. Handling circled numbers

```sql
-- ① ② ③ circled numbers will be converted to regular digits
SELECT unicode_normalize('①②③', 'NFKC') AS result;
```

```text
+--------+
| result |
+--------+
| 123 |
+--------+
```

6. Comparing different modes on the same string

```sql
SELECT
unicode_normalize('fi', 'NFC') AS nfc_result,
unicode_normalize('fi', 'NFKC') AS nfkc_result;
```

```text
+------------+-------------+
| nfc_result | nfkc_result |
+------------+-------------+
| fi | fi |
+------------+-------------+
```

7. String equality comparison scenario

```sql
-- Use normalization to compare visually identical but differently encoded strings
SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal;
```

```text
+----------+
| is_equal |
+----------+
| 1 |
+----------+
```
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ PROPERTIES (
- 参数
- `char_filter_pattern`:需要替换的字符列表
- `char_filter_replacement`:替换后的字符(默认空格)
`icu_normalizer`:使用 ICU 标准化对文本进行预处理。
- 参数
- `name`:标准化形式(默认 `nfkc_cf`)。可选:`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
- `mode`:标准化模式(默认 `compose`)。可选:`compose`(组合)、`decompose`(分解)
- `unicode_set_filter`:指定需要标准化的字符集(如 `[a-z]`)

#### 2. tokenizer(分词器)

Expand Down Expand Up @@ -81,6 +86,9 @@ PROPERTIES (
- `type_table`:自定义字符类型映射(如 `[+ => ALPHA, - => ALPHA]`),类型含 `ALPHA`、`ALPHANUM`、`DIGIT`、`LOWER`、`SUBWORD_DELIM`、`UPPER`
- `ascii_folding`:将非 ASCII 字符映射为等效 ASCII
- `lowercase`:将 token 文本转为小写
- `icu_normalizer`:使用 ICU 标准化对词元进行处理。
- `name`:标准化形式(默认 `nfkc_cf`)。可选:`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
- `unicode_set_filter`:指定需要标准化的字符集

#### 4. analyzer(分析器)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
{
"title": "自定义标准化",
"language": "zh-CN"
}
---

## 概述

自定义标准化(Normalizer)用于对文本进行统一的预处理,通常用于不需要分词但需要标准化的场景(如关键字搜索)。与分词器(Analyzer)不同,Normalizer 不会对文本进行切分,而是将整个文本作为一个完整的词项(Token)进行处理,支持组合字符过滤器和词元过滤器,以实现大小写转换、字符归一化等功能。

## 使用自定义标准化

### 创建

自定义标准化器主要由字符过滤器(char_filter)和词元过滤器(token_filter)组成。

> 注意:`char_filter` 和 `token_filter` 的详细创建方式请参考[自定义分词]文档。

```sql
CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
PROPERTIES (
"char_filter" = "x_char_filter", -- 可选,一个或多个字符过滤器
"token_filter" = "x_filter1, x_filter2" -- 可选,一个或多个词元过滤器,按顺序执行
);
```

### 查看

```sql
SHOW INVERTED INDEX NORMALIZER;
```

### 删除

```sql
DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
```

## 建表中使用自定义标准化

在倒排索引属性中使用 `normalizer` 指定自定义标准化器。

**注意**:`normalizer` 与 `analyzer` 互斥,不能同时在同一个索引中指定。

```sql
CREATE TABLE tbl (
`id` bigint NOT NULL,
`code` text NULL,
INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = "x_custom_normalizer")
)
...
```

## 使用限制

1. `char_filter` 和 `token_filter` 中引用的名称必须存在(内置或已创建)。
2. 只有在没有任何表使用 normalizer 的时候才能删除它。
3. 只有在没有任何 normalizer 使用 char_filter 或 token_filter 的情况下才能删除对应的 filter。
4. 使用自定义标准化语法 10s 后会被同步到 BE,之后导入正常不会报错。

## 完整示例

### 示例:忽略大小写与特殊重音符号

本示例展示如何创建一个标准化器,将文本转换为小写并移除重音符号(例如将 `Café` 标准化为 `cafe`),适用于不区分大小写和重音的精确匹配。

```sql
-- 1. 创建自定义词元过滤器(如果需要特定参数)
-- 此处创建一个 ascii_folding 过滤器
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
PROPERTIES
(
"type" = "ascii_folding",
"preserve_original" = "false"
);

-- 2. 创建标准化器
-- 组合使用 lowercase(内置)和 my_ascii_folding
CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
PROPERTIES
(
"token_filter" = "lowercase, my_ascii_folding"
);

-- 3. 建表使用
CREATE TABLE product_table (
`id` bigint NOT NULL,
`product_name` text NULL,
INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = "lowercase_ascii_normalizer")
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

-- 4. 验证测试
select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
```

返回结果:
```json
[
{"token":"cafe-products"}
]
```
Loading
Loading