Query on Multi language model #11984
Replies: 1 comment
-
You can use a config like the one you have for input in multiple languages, and you actually don't even have to use the xx language code. The question is whether that'll do what you want. The main thing the language code setting does is change the tokenizer. The xx tokenizer is the same as the English tokenizer, but without any of the tokenizer exceptions and so on, so the output will be quite different sometimes. It can also change a number of lexical features, the Note there are potential issues with training on input in multiple languages. For example, a word with different meanings in each language might confuse the model, or if the languages share too few words the model might be over-extended and have trouble learning. xlm-roberta is trained with these in mind, but you still might expect decreased performance compared to a monolingual transformer. Also note it probably won't work well for languages that really need a custom tokenizer because they don't use spaces to separate words, like Japanese. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
My query is, can I be able to train the
Spacy
blank model to support multiple languages usingXLM-roberta-base
pretrained model?If so, is the below config is correct?
Complete sample config file,
Beta Was this translation helpful? Give feedback.
All reactions