Custom tokenization based on the sentence structure #11083
orglce
started this conversation in
Language Support
Replies: 1 comment 3 replies
-
This isn't possible without custom code because the regex-based tokenizer analyzes each space-separated string independently and it doesn't have access to any information about sentence boundaries. With custom code, you could potentially have a solution where the tokenizer always splits cases like |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm adding custom rules for the tokenization of Slovene. In Slovene, a letter followed by a dot denotes a series of things. (the same as n-th in english, so 4. would mean fourth or 4th). So my first thought was just not to split numbers and dots. But sometimes the number appears at the end of the sentence. In that case I would want to split it always regardless of it denoting a series or a normal number.
Example (I will write the series in Slovenian notation):
TEXT: He finished in the 2. place.
TOKENIZATION: He - finished - in - the - 2. - place - .
TEXT: It was build in 1998.
TOKENIZATION: It - was - built - in - 1998 - .
Is there a way to keep numbers and dots together unless they appear at the end of the sentence. I would like to solve this with the files in spacy/lang without running custom code as I would like the tokenization to work out of the box.
Beta Was this translation helpful? Give feedback.
All reactions