Custom tokenization based on the sentence structure #11083

orglce · 2022-07-05T12:51:31Z

orglce
Jul 5, 2022

I'm adding custom rules for the tokenization of Slovene. In Slovene, a letter followed by a dot denotes a series of things. (the same as n-th in english, so 4. would mean fourth or 4th). So my first thought was just not to split numbers and dots. But sometimes the number appears at the end of the sentence. In that case I would want to split it always regardless of it denoting a series or a normal number.

Example (I will write the series in Slovenian notation):

TEXT: He finished in the 2. place.
TOKENIZATION: He - finished - in - the - 2. - place - .

TEXT: It was build in 1998.
TOKENIZATION: It - was - built - in - 1998 - .

Is there a way to keep numbers and dots together unless they appear at the end of the sentence. I would like to solve this with the files in spacy/lang without running custom code as I would like the tokenization to work out of the box.

adrianeboyd · 2022-07-06T07:21:57Z

adrianeboyd
Jul 6, 2022

This isn't possible without custom code because the regex-based tokenizer analyzes each space-separated string independently and it doesn't have access to any information about sentence boundaries.

With custom code, you could potentially have a solution where the tokenizer always splits cases like 2 . and a custom component retokenizes to merge 2. when it's not at the end of a sentence.

3 replies

orglce Jul 6, 2022
Author

Thanks, I thought so. Would it still be possible to include such a component in the official Slovenian model once it ready and if you choose to support it or are the official models made with vanilla components only?

adrianeboyd Jul 6, 2022

So far the official models only use built-in components, but they do sometimes have additional dependencies like spacy-pkuseg or sudachipy, so it's possible to consider providing custom components through additional dependencies. There are also other languages that have the same problem (German is one of them), so if there's a small component we could add to the core library, we could consider that.

I have to say I'm having trouble imagining how to incorporate it into a core pipeline, though. Since you'd need sentence boundaries to correct the tokens, and the sentence boundaries come from the parser, however you'd want to have the tokens corrected before you run the parser or you'd get lower quality parses.

Hmm, it's possible it could be a case where the parser could learn to retokenize. This functionality is included as an experimental feature, but I haven't tested it in a while. It was mainly intended for Chinese (to do word segmentation while parsing) but we never got it working well enough to use in practice. This might be an easy case to try it on again. You'd want the tokenizer to always split . and then the parser would be allowed to merge tokens like 5 . while parsing based on the tokenization in the training data. The parser feature is called learn_tokens (again, it might be really unstable): https://spacy.io/api/dependencyparser#config

orglce Jul 8, 2022
Author

Yeah learn_tokens will probably come in handy. I will experiment a bit and try to figure it out. Thanks for the info and help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom tokenization based on the sentence structure #11083

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom tokenization based on the sentence structure #11083

Uh oh!

orglce Jul 5, 2022

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Jul 6, 2022

Uh oh!

orglce Jul 6, 2022 Author

Uh oh!

adrianeboyd Jul 6, 2022

Uh oh!

orglce Jul 8, 2022 Author

orglce
Jul 5, 2022

Replies: 1 comment 3 replies

adrianeboyd
Jul 6, 2022

orglce Jul 6, 2022
Author

orglce Jul 8, 2022
Author