Training Thai transformer & tokenizer #10542

kanayer · 2022-03-23T10:36:40Z

kanayer
Mar 23, 2022

I am trying to train the spaCy pipeline for the Thai language consisting of a transformer, tagger, and parser and I was wondering which model is the best for the task. I have cloned the UD Benchmark pipeline and removed the experimental character-based NER tokenizer and the experimental edit tree lemmatizer , basically leaving only the transformer component. My dataset consists of 10 000 sentences with words that were segmented properly.

As you know, the default transformer in the UD Benchmark pipeline is xlm-roberta-base. When I train it on my dataset together with the default tokenizer spacy.Tokenizer.v1, I run into the following error message on the spacy project run evaluate step:

Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors.

The sentence length in the dataset varies from 1 to 33 words. Could this be a problem? Is it possible to edit the maximum sequence length for the xlm-roberta-base model?

Also, training quickstart recommends monsoon-nlp/bert-base-thai as the language-specific transformer model. However, there is the following warning message on the HuggingFace page of this transformer model:

You must run the original ThaiTokenizer to have your tokenization match that of the original model.

Is it possible to use ThaiTokenizer instead of the experimental character-based NER tokenizer in the spaCy pipeline?
And overall, which transformer model would you advise training the Thai dataset (with properly segmented words) on?

polm · 2022-03-24T07:59:14Z

polm
Mar 24, 2022

The sentence length in the dataset varies from 1 to 33 words. Could this be a problem? Is it possible to edit the maximum sequence length for the xlm-roberta-base model?

I'm not exactly sure what would cause that, but I think what is happening is that because your tokenizer doesn't match the one used for xml-roberta-base you're getting OOV tokens or something. Arbitrarily long sequences should be handled by striding; it's possible something is off with your striding setting. xml-roberta-base probably uses a simple tokenizer that will give different results than properly tokenized or character tokenized Thai.

Is it possible to use ThaiTokenizer instead of the experimental character-based NER tokenizer in the spaCy pipeline?

Yes, it looks like ThaiTokenizer is the default tokenizer in our Thai configuration. I don't think any of us are very familiar with it, so you'd have to confirm the settings were the same as whatever model you were using, but it should be similar.

And overall, which transformer model would you advise training the Thai dataset (with properly segmented words) on?

The model in the quickstart for Thai, like many languages we don't provide full pipelines for, is one that seemed popular and that we confirmed would run in spaCy. However we didn't confirm the quality of results or compare it with other models, so we can't really give you any recommendation there.

11 replies

adrianeboyd Mar 25, 2022

About the transformer tokenizer length warning: #9277 (comment)

The spacy CoNLL-U converter uses SpaceAfter=No to handle spacing, not # text, so the example above won't be converted correctly.

kanayer Mar 25, 2022
Author

Oh! Then do I understand correctly that if I add SpaceAfter=No and SpaceAfter=Yes to the dataset, it should handle spacing properly? Also, if I want SpaceAfter to represent text in #segmented, do I have to rename it to #text?

I'm just curious, could you please briefly explain how CoNLL-U converter affects tokenizer?

adrianeboyd Mar 25, 2022

There is no SpaceAfter=Yes, just SpaceAfter=No.

The spacy CoNLL-U converter currently ignores all lines that start with #.

The converter and the tokenizer are not connected in any way. The converter does not use a tokenizer and how you converted the data does not affect the tokenizer in your config.

kanayer Mar 25, 2022
Author

Oh, sorry for the stupid question but then what did you mean by this?

The spacy CoNLL-U converter uses SpaceAfter=No to handle spacing, not # text, the example above won't be converted correctly.

If I need tokens to be converted as in #segmented, and all # are ignored, then no need to add SpaceAfter because it is already space separated as in #segmented?

And since SpaceAfter has no effect on tokenizer, the only thing that might help resolve the problem of the error about maximum sequence length is lowering window and stride?

adrianeboyd Mar 25, 2022

With the spacy converter you can control the trailing spaces in the converted docs by whether you have SpaceAfter=No in MISC. Just check the docs in output.spacy to see whether the spacing is what you want.

And I guess what I said was potentially a bit misleading: whether the text itself contains whitespace between tokens will obviously affect what comes out of the tokenizer. However, the underlying tokenizer config and how it handles (a lack of) whitespace isn't affected by how the docs were converted. You have to configure your tokenizer so it works for your particular input data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training Thai transformer & tokenizer #10542

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 11 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training Thai transformer & tokenizer #10542

Uh oh!

Uh oh!

kanayer Mar 23, 2022

Replies: 1 comment · 11 replies

Uh oh!

polm Mar 24, 2022

Uh oh!

adrianeboyd Mar 25, 2022

Uh oh!

Uh oh!

kanayer Mar 25, 2022 Author

Uh oh!

adrianeboyd Mar 25, 2022

Uh oh!

kanayer Mar 25, 2022 Author

Uh oh!

adrianeboyd Mar 25, 2022

kanayer
Mar 23, 2022

Replies: 1 comment 11 replies

polm
Mar 24, 2022

kanayer Mar 25, 2022
Author

kanayer Mar 25, 2022
Author