Skip to content

Commit 8e2a561

Browse files
authored
Update luis-language-support.md
Adding new supported languages and new tokenizer for Dutch. Removed the tokenized entity returned column in tokenization as I'm not sure what this refers to? The link sends you to a sentence with a link that returns back to the table. Perhaps I'm missing something.
1 parent 246d38d commit 8e2a561

File tree

1 file changed

+39
-23
lines changed

1 file changed

+39
-23
lines changed

articles/cognitive-services/LUIS/luis-language-support.md

Lines changed: 39 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -30,18 +30,25 @@ LUIS understands utterances in the following languages:
3030
| American English |`en-US` |||||
3131
| Arabic (preview - modern standard Arabic) |`ar-AR`|-|-|-|-|
3232
| *[Chinese](#chinese-support-notes) |`zh-CN` ||||-|
33-
| Dutch |`nl-NL` || - |-||
33+
| Dutch |`nl-NL` ||-|-||
3434
| French (France) |`fr-FR` |||||
35-
| French (Canada) |`fr-CA` |-| - |-||
35+
| French (Canada) |`fr-CA` |-|-|-||
3636
| German |`de-DE` |||||
37-
| Hindi | `hi-IN`|-|-|-|-|
37+
| Gujarati | `gu-IN`|-|-|-|-|
38+
| Hindi | `hi-IN`|-||-|-|
3839
| Italian |`it-IT` |||||
3940
| *[Japanese](#japanese-support-notes) |`ja-JP` ||||Key phrase only|
40-
| Korean |`ko-KR` || - |-|Key phrase only|
41+
| Korean |`ko-KR` ||-|-|Key phrase only|
42+
| Marathi | `mr-IN`|-|-|-|-|
4143
| Portuguese (Brazil) |`pt-BR` ||||not all sub-cultures|
4244
| Spanish (Spain) |`es-ES` |||||
43-
| Spanish (Mexico)|`es-MX` |-| - |||
44-
| Turkish | `tr-TR` ||-|-|Sentiment only|
45+
| Spanish (Mexico)|`es-MX` |-|-|||
46+
| Tamil | `ta-IN`|-|-|-|-|
47+
| Telugu | `te-IN`|-|-|-|-|
48+
| Turkish | `tr-TR` |||-|Sentiment only|
49+
50+
51+
4552

4653
Language support varies for [prebuilt entities](luis-reference-prebuilt-entities.md) and [prebuilt domains](luis-reference-prebuilt-domains.md).
4754

@@ -72,22 +79,28 @@ Hybrid languages combine words from two cultures such as English and Chinese. Th
7279
## Tokenization
7380
To perform machine learning, LUIS breaks an utterance into [tokens](luis-glossary.md#token) based on culture.
7481

75-
|Language| every space or special character | character level|compound words|[tokenized entity returned](luis-concept-data-extraction.md#tokenized-entity-returned)
76-
|--|:--:|:--:|:--:|:--:|
77-
|Arabic|||||
78-
|Chinese|||||
79-
|Dutch|||||
80-
|English (en-us)|||||
81-
|French (fr-FR)|||||
82-
|French (fr-CA)|||||
83-
|German|||||
84-
| Hindi ||-|-|-|-|
85-
|Italian|||||
86-
|Japanese|||||
87-
|Korean|||||
88-
|Portuguese (Brazil)|||||
89-
|Spanish (es-ES)|||||
90-
|Spanish (es-MX)|||||
82+
|Language| every space or special character | character level|compound words
83+
|--|:--:|:--:|:--:|
84+
|Arabic||||
85+
|Chinese||||
86+
|Dutch||||
87+
|English (en-us)||||
88+
|French (fr-FR)||||
89+
|French (fr-CA)||||
90+
|German||||
91+
|Gujarati||||
92+
|Hindi||||
93+
|Italian||||
94+
|Japanese|||✔
95+
|Korean||||
96+
|Marathi||||
97+
|Portuguese (Brazil)||||
98+
|Spanish (es-ES)||||
99+
|Spanish (es-MX)||||
100+
|Tamil||||
101+
|Telugu||||
102+
|Turkish||||
103+
91104

92105
### Custom tokenizer versions
93106

@@ -96,7 +109,10 @@ The following cultures have custom tokenizer versions:
96109
|Culture|Version|Purpose|
97110
|--|--|--|
98111
|German<br>`de-de`|1.0.0|Tokenizes words by splitting them using a machine learning-based tokenizer that tries to break down composite words into their single components.<br>If a user enters `Ich fahre einen krankenwagen` as an utterance, it is turned to `Ich fahre einen kranken wagen`. Allowing the marking of `kranken` and `wagen` independently as different entities.|
99-
|German<br>`de-de`|1.0.2|Tokenizes words by splitting them on spaces.<br> if a user enters `Ich fahre einen krankenwagen` as an utterance, it remains a single token. Thus `krankenwagen` is marked as a single entity. |
112+
|German<br>`de-de`|1.0.2|Tokenizes words by splitting them on spaces.<br> If a user enters `Ich fahre einen krankenwagen` as an utterance, it remains a single token. Thus `krankenwagen` is marked as a single entity. |
113+
|Dutch<br>`de-de`|1.0.0|Tokenizes words by splitting them using a machine learning-based tokenizer that tries to break down composite words into their single components.<br>If a user enters `Ik ga naar de kleuterschool` as an utterance, it is turned to `Ik ga naar de kleuter school`. Allowing the marking of `kleuter` and `school` independently as different entities.|
114+
|Dutch<br>`de-de`|1.0.1|Tokenizes words by splitting them on spaces.<br> If a user enters `Ik ga naar de kleuterschool` as an utterance, it remains a single token. Thus `kleuterschool` is marked as a single entity. |
115+
100116

101117
### Migrating between tokenizer versions
102118
<!--

0 commit comments

Comments
 (0)