Training Thai transformer & tokenizer #10542
Replies: 1 comment 11 replies
-
I'm not exactly sure what would cause that, but I think what is happening is that because your tokenizer doesn't match the one used for xml-roberta-base you're getting OOV tokens or something. Arbitrarily long sequences should be handled by striding; it's possible something is off with your striding setting. xml-roberta-base probably uses a simple tokenizer that will give different results than properly tokenized or character tokenized Thai.
Yes, it looks like ThaiTokenizer is the default tokenizer in our Thai configuration. I don't think any of us are very familiar with it, so you'd have to confirm the settings were the same as whatever model you were using, but it should be similar.
The model in the quickstart for Thai, like many languages we don't provide full pipelines for, is one that seemed popular and that we confirmed would run in spaCy. However we didn't confirm the quality of results or compare it with other models, so we can't really give you any recommendation there. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to train the spaCy pipeline for the Thai language consisting of a transformer, tagger, and parser and I was wondering which model is the best for the task. I have cloned the UD Benchmark pipeline and removed the
experimental character-based NER tokenizer
and theexperimental edit tree lemmatizer
, basically leaving only the transformer component. My dataset consists of10 000
sentences with words that were segmented properly.As you know, the default transformer in the UD Benchmark pipeline is
xlm-roberta-base
. When I train it on my dataset together with the default tokenizerspacy.Tokenizer.v1
, I run into the following error message on thespacy project run evaluate
step:The sentence length in the dataset varies from 1 to 33 words. Could this be a problem? Is it possible to edit the maximum sequence length for the
xlm-roberta-base
model?Also, training quickstart recommends
monsoon-nlp/bert-base-thai
as the language-specific transformer model. However, there is the following warning message on the HuggingFace page of this transformer model:Is it possible to use
ThaiTokenizer
instead of theexperimental character-based NER tokenizer
in the spaCy pipeline?And overall, which transformer model would you advise training the Thai dataset (with properly segmented words) on?
Beta Was this translation helpful? Give feedback.
All reactions