Skip to content

Commit c97be71

Browse files
committed
Remove tokenizer and normalizer section
1 parent 2c36837 commit c97be71

File tree

1 file changed

+1
-26
lines changed

1 file changed

+1
-26
lines changed

guides/multilingual-datasets.mdx

Lines changed: 1 addition & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,6 @@ description: This guide covers indexing strategies, language-specific tokenizers
55

66
When working with datasets that include content in multiple languages, it’s important to ensure that both documents and queries are processed correctly. This guide explains how to index and search multilingual datasets in Meilisearch, highlighting best practices, useful features, and what to avoid.
77

8-
## Tokenizers and language differences
9-
10-
Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:
11-
12-
- **Space-separated languages** (English, French, Spanish):
13-
14-
Words are clearly separated by spaces, making them straightforward to tokenize.
15-
16-
- **Non-space-separated languages** (Chinese, Japanese):
17-
18-
Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.
19-
20-
- **Languages with compound words** (German, Swedish):
21-
22-
Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.
23-
24-
### Normalization differences
25-
26-
Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.
27-
28-
- **Accents and diacritics**:
29-
30-
In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).
31-
32-
In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.
338

349
## Recommended indexing strategy
3510

@@ -65,7 +40,7 @@ In some cases, you may prefer to keep multiple languages in a **single index**.
6540

6641
#### Limitations
6742

68-
- Languages with compound words (like German) or diacritics that change meaning (like Swedish), as well as non-space-separated writing systems (like Chinese, or Japanese), work better in their own index since they require specialized tokenizers.
43+
- Languages with compound words (like German) or diacritics that change meaning (like Swedish), as well as non-space-separated writing systems (like Chinese, or Japanese), work better in their own index since they require specialized [tokenizers](/learn/indexing/tokenization).
6944

7045
- Chinese and Japanese documents should not be mixed in the same field, since distinguishing between them automatically is very difficult. Each of these languages works best in its own dedicated index. However, if fields are strictly separated by language (e.g., title_zh always Chinese, title_ja always Japanese), it is possible to store them in the same index.
7146

0 commit comments

Comments
 (0)