Tokenizer exception with regex does not work as expected #11327
-
I want spacy to recognize tokens like:
The following regex is the pattern: I added a token_match to the tokenizer:
This works fine for situations where there are whitespaces between the words. I used the tokenizer.explain function to look what happens: Why does spacy not recognize the patterns in a string without whitespaces? I tried the regex on itself with the given string and it works. Can you please come up with a working solution to the problem. Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
token_match
is just designed to prevent things from being split if they would otherwise be split by infixes, it doesn't split out strings from larger tokens. You might be able to add your regex to the infixes list to get the behaviour you want, but looking at strings like the one you have where there are a bunch of concatenated tokens, I think running a regex over your raw text to surround any CVE strings with spaces as a preprocessing step will probably be the best solution.