You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding new supported languages and new tokenizer for Dutch.
Removed the tokenized entity returned column in tokenization as I'm not sure what this refers to? The link sends you to a sentence with a link that returns back to the table. Perhaps I'm missing something.
| Portuguese (Brazil) |`pt-BR`|✔| ✔ |✔ |not all sub-cultures|
42
44
| Spanish (Spain) |`es-ES`|✔| ✔ |✔|✔|
43
-
| Spanish (Mexico)|`es-MX`|-| - |✔|✔|
44
-
| Turkish |`tr-TR`|✔|-|-|Sentiment only|
45
+
| Spanish (Mexico)|`es-MX`|-|-|✔|✔|
46
+
| Tamil |`ta-IN`|-|-|-|-|
47
+
| Telugu |`te-IN`|-|-|-|-|
48
+
| Turkish |`tr-TR`|✔|✔|-|Sentiment only|
49
+
50
+
51
+
45
52
46
53
Language support varies for [prebuilt entities](luis-reference-prebuilt-entities.md) and [prebuilt domains](luis-reference-prebuilt-domains.md).
47
54
@@ -72,22 +79,28 @@ Hybrid languages combine words from two cultures such as English and Chinese. Th
72
79
## Tokenization
73
80
To perform machine learning, LUIS breaks an utterance into [tokens](luis-glossary.md#token) based on culture.
74
81
75
-
|Language| every space or special character | character level|compound words|[tokenized entity returned](luis-concept-data-extraction.md#tokenized-entity-returned)
76
-
|--|:--:|:--:|:--:|:--:|
77
-
|Arabic|||||
78
-
|Chinese||✔||✔|
79
-
|Dutch|||✔|✔|
80
-
|English (en-us)|✔ ||||
81
-
|French (fr-FR)|✔||||
82
-
|French (fr-CA)|✔||||
83
-
|German|||✔|✔|
84
-
| Hindi |✔|-|-|-|-|
85
-
|Italian|✔||||
86
-
|Japanese||||✔|
87
-
|Korean||✔||✔|
88
-
|Portuguese (Brazil)|✔||||
89
-
|Spanish (es-ES)|✔||||
90
-
|Spanish (es-MX)|✔||||
82
+
|Language| every space or special character | character level|compound words
83
+
|--|:--:|:--:|:--:|
84
+
|Arabic|✔|||
85
+
|Chinese||✔||
86
+
|Dutch|✔||✔|
87
+
|English (en-us)|✔ |||
88
+
|French (fr-FR)|✔|||
89
+
|French (fr-CA)|✔|||
90
+
|German|✔||✔|
91
+
|Gujarati|✔|||
92
+
|Hindi|✔|||
93
+
|Italian|✔|||
94
+
|Japanese|||✔
95
+
|Korean||✔||
96
+
|Marathi|✔|||
97
+
|Portuguese (Brazil)|✔|||
98
+
|Spanish (es-ES)|✔|||
99
+
|Spanish (es-MX)|✔|||
100
+
|Tamil|✔|||
101
+
|Telugu|✔|||
102
+
|Turkish|✔|||
103
+
91
104
92
105
### Custom tokenizer versions
93
106
@@ -96,7 +109,10 @@ The following cultures have custom tokenizer versions:
96
109
|Culture|Version|Purpose|
97
110
|--|--|--|
98
111
|German<br>`de-de`|1.0.0|Tokenizes words by splitting them using a machine learning-based tokenizer that tries to break down composite words into their single components.<br>If a user enters `Ich fahre einen krankenwagen` as an utterance, it is turned to `Ich fahre einen kranken wagen`. Allowing the marking of `kranken` and `wagen` independently as different entities.|
99
-
|German<br>`de-de`|1.0.2|Tokenizes words by splitting them on spaces.<br> if a user enters `Ich fahre einen krankenwagen` as an utterance, it remains a single token. Thus `krankenwagen` is marked as a single entity. |
112
+
|German<br>`de-de`|1.0.2|Tokenizes words by splitting them on spaces.<br> If a user enters `Ich fahre einen krankenwagen` as an utterance, it remains a single token. Thus `krankenwagen` is marked as a single entity. |
113
+
|Dutch<br>`de-de`|1.0.0|Tokenizes words by splitting them using a machine learning-based tokenizer that tries to break down composite words into their single components.<br>If a user enters `Ik ga naar de kleuterschool` as an utterance, it is turned to `Ik ga naar de kleuter school`. Allowing the marking of `kleuter` and `school` independently as different entities.|
114
+
|Dutch<br>`de-de`|1.0.1|Tokenizes words by splitting them on spaces.<br> If a user enters `Ik ga naar de kleuterschool` as an utterance, it remains a single token. Thus `kleuterschool` is marked as a single entity. |
0 commit comments