Word spacing/word segmentation model #10973

sarah-kai · 2022-06-16T05:49:15Z

sarah-kai
Jun 16, 2022

I was wondering if there is a way to train Tokenizer to split the text without any space? For example, in Korean people often type without any white space and an input text can look like "맛있는음식먹어싶다" which translates to "I wanna eat tasty food." Is there a way to train the model to split "맛있는음식먹어싶다" into "맛있는 음식 먹어 싶다"?

polm · 2022-06-17T02:33:03Z

polm
Jun 17, 2022

For the spaCy Korean pipelines, we use mecab-ko to handle that problem. We do not have a trainable tokenizer as a standard feature, though you can kind of do it using existing components.

If you want to train a model to do this from scratch in spaCy, the easiest way to do it would be to put a space between every character and use NER labels on them, essentially labelling each space-delimited chunk as an entity. You could then train the model to reproduce those labels, which would let you figure out where spaces should go in inputs. It would be easy to set that up, but the NER model wasn't really designed to be used that way, so I'm not sure it would work well. You could maybe also use the SentenceRecognizer, which handles a similar problem where you have a single class.

In the past I think there was an experimental tokenization component, but it hasn't been worked on in a while. I believe the learn_tokens feature may be related.

0 replies

adrianeboyd · 2022-06-20T11:41:13Z

adrianeboyd
Jun 20, 2022

To add a few notes:

There is an experimental trainable tokenizer using NER like Paul described: https://github.com/explosion/spacy-experimental/tree/v0.5.0#character-based-ner-tokenizer. It works well for languages with whitespace where it's mainly learning how to split off punctuation, but when I tried it for Chinese it didn't work very well. It's possible it might work a little better for Korean where grammatical suffixes might be easier to identify as word boundaries, but I'm not sure, and my guess would be that tools like mecab-ko would work better in general.

The learn_tokens feature was intended to retokenize words from individual characters as part of parsing (e.g. for Chinese), but it also never worked well enough in practice and we haven't worked on it in so long that I'm not sure it still works.

6 replies

adrianeboyd Jul 19, 2022

The NER example is just for explanation (the actual tag/NER part is entirely internals). You just have to create normal train.spacy docs with the desired final tokenization. As here up through train-tokenizer: https://github.com/explosion/projects/blob/v3/benchmarks/ud_benchmark/project.yml

kanayer Jul 19, 2022

Oh, I see thanks! One more question: how does one use the trained tokenizer? I've tried to pip install and used via spacy.load, however even though the training accuracy was high, the model isn't able to tokenize the text

adrianeboyd Jul 19, 2022

It's really hard to know what's going on from screenshots, especially without more context. (In general, please copy and paste text as code blocks instead of using screenshots!)

My first guesses would be either that your training docs don't have the right tokenization or that you're not loading the right model in your test?

I'm sure that this component works in the UD benchmarks example project and in the spacy-experimental tests. There's a pretty minimal training example in the tests: https://github.com/explosion/spacy-experimental/blob/7ca435d01aa87b2e61760a266fe8cdb748043ee1/spacy_experimental/char_tokenizer/tests/test_char_tokenizers.py

kanayer Jul 19, 2022

Sorry, I'll paste the code from now on. I have used the ud-benchmark (with deleted transformer part from the project.yaml) to train, assemble and package the pipeline consisting of only the tokenizer model called ko_mindai_tokenizer. After I packaged the model, I pip installed it and called the following way:

p = spacy.load("ko_mindai_tokenizer")
doc = nlp('너무편하게그냥제작이되는거예요')

for token in doc:
  print(token.text)

Below is the short piece from conllu file that gets converted to .spacy doc format using the convert command. Should the input doc look different? Or am I loading the wrong way?

# text = 여성의 가슴 어쩌라고?
1	여성의	여성+의	NOUN	NNG+JKG	_	2	nmod:poss	_	_
2	가슴	가슴	NOUN	NNG	_	3	obj	_	_
3	어쩌라고	어쩌+라고	VERB	VV+EF+SF	_	0	root	_	_
4	?	?	PUNCT	SF	_	3	punct	_	_

Also, in the example you've provided the English text is This is a sentence, but would the tokenizer be able to split it if the text was Thisisasentence ?

adrianeboyd Jul 19, 2022

Without SpaceAfter=No, your training examples will look like this as converted spacy docs (note the extra space in front of ?:

여성의 가슴 어쩌라고 ?

So it's only learning to split on whitespace. As I said at the top of this thread, I don't think these particular tokenizers will work well for cases without whitespace. I'd be curious to hear about the results for Korean, but I haven't tried it myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Word spacing/word segmentation model #10973

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Word spacing/word segmentation model #10973

Uh oh!

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 2 comments 6 replies