You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When working with datasets that include content in multiple languages, it’s important to ensure that both documents and queries are processed correctly. This guide explains how to index and search multilingual datasets in Meilisearch, highlighting best practices, useful features, and what to avoid.
7
7
8
-
## Tokenizers and language differences
9
-
10
-
Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:
Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.
19
-
20
-
-**Languages with compound words** (German, Swedish):
21
-
22
-
Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.
23
-
24
-
### Normalization differences
25
-
26
-
Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.
27
-
28
-
-**Accents and diacritics**:
29
-
30
-
In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).
31
-
32
-
In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.
33
8
34
9
## Recommended indexing strategy
35
10
@@ -65,7 +40,7 @@ In some cases, you may prefer to keep multiple languages in a **single index**.
65
40
66
41
#### Limitations
67
42
68
-
- Languages with compound words (like German) or diacritics that change meaning (like Swedish), as well as non-space-separated writing systems (like Chinese, or Japanese), work better in their own index since they require specialized tokenizers.
43
+
- Languages with compound words (like German) or diacritics that change meaning (like Swedish), as well as non-space-separated writing systems (like Chinese, or Japanese), work better in their own index since they require specialized [tokenizers](/learn/indexing/tokenization).
69
44
70
45
- Chinese and Japanese documents should not be mixed in the same field, since distinguishing between them automatically is very difficult. Each of these languages works best in its own dedicated index. However, if fields are strictly separated by language (e.g., title_zh always Chinese, title_ja always Japanese), it is possible to store them in the same index.
0 commit comments