Tokenizing Special cases with whitespace #13118
-
I want to tell the tokenizer to not split Tokens like pi 123 into two seperate tokens, but treat it as one token. I tried a custom infix 'PI \d+', but apperantly infixes are called after whitespace tokenisation. What worked was adding pi 123 into a special_cases file:
But since there are a lot of combination i need to use regular expressions |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The tokenizer only supports exact string matches as special cases. Instead, you can add a custom component at the beginning of your pipeline that matches based on regexes and uses the retokenizer to adjust the tokenization before the rest of the components are run: https://spacy.io/usage/linguistic-features#retokenization |
Beta Was this translation helpful? Give feedback.
The tokenizer only supports exact string matches as special cases. Instead, you can add a custom component at the beginning of your pipeline that matches based on regexes and uses the retokenizer to adjust the tokenization before the rest of the components are run: https://spacy.io/usage/linguistic-features#retokenization