Question about segmentation #10792

sarah-kai · 2022-05-12T15:57:32Z

sarah-kai
May 12, 2022

Hello! I'm developing a Korean dataset and I was wondering if #text metadata part of the conllu file is taken into account during the training & evaluation processes.

Would it cause an error if the number of words in the #text does not match with the actual number of words in the analysis? For example, if the sentence in the #text would be 언제부터 언제까지 주문 배송함? (From when to when will the order be delivered?) and it would be analyzed into separate tokens such as 언제, 부터, 언제, 까지, 주문, 배송함, as shown below:

I saw a similar analysis happening in the Japanese GSB dataset and I was wondering if breaking a sentence without space into separate tokens cause an error during the training & evaluation processes or does it help the spacy-transformer model (xml-roberta-base) learn some form of segmentation?

polm · 2022-05-13T05:20:12Z

polm
May 13, 2022

When sharing text please copy and paste it as text, do not share screenshots of text.

I was wondering if #text metadata part of the conllu file is taken into account during the training & evaluation processes.

No, the text part is not used at all.

Would it cause an error if the number of words in the #text does not match with the actual number of words in the analysis?

Since the # text line isn't used that doesn't matter. However, more generally, differences in tokenization between training and inference shouldn't cause an error, but they can cause issues. For example if tokenization differs it may result in lots of tokens being things the model hasn't seen before and doesn't know anything about. Also there are cases were we haven't tested tokenization differences enough with a particular component and issues happen.

That said, your Korean tokenization doesn't look like there's any reason it would be a problem in particular.

does it help the spacy-transformer model (xml-roberta-base) learn some form of segmentation?

spacy-transformers doesn't have any feature to learn segmentation.

0 replies

adrianeboyd · 2022-05-13T06:42:07Z

adrianeboyd
May 13, 2022

I think there are two separate issues/questions here:

Does spacy convert -c conllu use # text?

It doesn't, but instead it uses SpaceAfter=No from the MISC column to convert the spacing from the original corpus. In this example, you would want to add SpaceAfter=No to tokens 1, 3, and 6 to represent the spaces shown in # text. You can double-check the Doc objects in your .spacy file before training to make sure the text and tokens have been converted how you expect.
Does spacy include a trainable tokenizer or word segmentation component that you can use with spacy train?

No, it doesn't. For languages that need word segmentation, spacy includes custom tokenizers that wrap external libraries like mecab-ko (Korean), jieba or pkuseg (Chinese), and sudachipy (Japanese). The training / dictionary creation is run outside of spacy and the spacy tokenizer just loads an existing model without training or adapting it further.

It looks like mecab-ko would produce the tokenization '언제', '부터', '언제', '까지', '주문', '배송', '함', '?' for this example, which has some misalignments with your example. Because all the pipeline components like the tagger and parser depend on having the correct tokenization, it doesn't make sense to start training other components until the tokenization performance is high. You might be able to use mecab-ko, but if your tokenization is not the same as mecab-ko in many cases, I would recommend writing a custom tokenizer that wraps the word segmentation tool you used when creating your corpus.

When working on a new language with regex-based tokenization with Tokenizer (like English), I generally try to get token_f above 0.995 before training a model. For languages with complicated word segmentation, it's usually not as high, but I'd definitely try to get it above 0.90, preferably 0.95, before trying to use spacy train to train other components. (For reference, zh_core_web_* is 0.929 and ja_core_news_* is 0.978. To be honest, the Chinese segmentation still isn't really good enough and the following pipeline components have much lower scores as a result.)

If you want to test mecab-ko, you can convert your corpus with spacy convert (after adding SpaceAfter=No) and then check with:
```
spacy evaluate blank:ko train.spacy -o metrics.json
```
The metric token_f isn't printed to the console, but it's saved in metrics.json. (Don't rely on the token_acc score, which doesn't work well for evaluating more complicated word segmentation.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question about segmentation #10792

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Question about segmentation #10792

Uh oh!

Uh oh!

sarah-kai May 12, 2022

Replies: 2 comments

Uh oh!

polm May 13, 2022

Uh oh!

adrianeboyd May 13, 2022

sarah-kai
May 12, 2022

polm
May 13, 2022

adrianeboyd
May 13, 2022