Tokenizer exception with regex does not work as expected #11327

drm-addoptio · 2022-08-17T07:38:03Z

drm-addoptio
Aug 17, 2022

I want spacy to recognize tokens like:

CVE-2021-26855
CVE-2021-26857
CVE-2021-26858
CVE-2021-27065

The following regex is the pattern:
"CVE-\d\d\d\d-\d{2,5}"

I added a token_match to the tokenizer:

matches = re.compile("CVE-\d\d\d\d-\d{2,5}").split
nlp.tokenizer.token_match = matches

This works fine for situations where there are whitespaces between the words.
But in the following situation it doesn't work:
String: "incident:CVE-2021-26855CVE-2021-26857CVE-2021-26858CVE-2021-27065We"

I used the tokenizer.explain function to look what happens:
incident:CVE-2021-26855CVE-2021-26857CVE-2021-26858CVE-2021-27065We TOKEN_MATCH

Why does spacy not recognize the patterns in a string without whitespaces? I tried the regex on itself with the given string and it works. Can you please come up with a working solution to the problem. Thank you in advance!

Answered by polm

Aug 18, 2022

token_match is just designed to prevent things from being split if they would otherwise be split by infixes, it doesn't split out strings from larger tokens. You might be able to add your regex to the infixes list to get the behaviour you want, but looking at strings like the one you have where there are a bunch of concatenated tokens, I think running a regex over your raw text to surround any CVE strings with spaces as a preprocessing step will probably be the best solution.

View full answer

polm · 2022-08-18T04:21:46Z

polm
Aug 18, 2022

token_match is just designed to prevent things from being split if they would otherwise be split by infixes, it doesn't split out strings from larger tokens. You might be able to add your regex to the infixes list to get the behaviour you want, but looking at strings like the one you have where there are a bunch of concatenated tokens, I think running a regex over your raw text to surround any CVE strings with spaces as a preprocessing step will probably be the best solution.

1 reply

drm-addoptio Aug 18, 2022
Author

@polm The infix solution works great. I will work with it. Thank you very much for this hint. This is how I did it:

infixes = nlp.Defaults.infixes + [r'''CVE-\d\d\d\d-\d{2,7}''',]
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

Leads to the following outcome:

incident         TOKEN
:        INFIX
CVE-2021-26855   INFIX
CVE-2021-26857   INFIX
CVE-2021-26858   INFIX
CVE-2021-27065   INFIX
We       TOKEN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenizer exception with regex does not work as expected #11327

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tokenizer exception with regex does not work as expected #11327

Uh oh!

drm-addoptio Aug 17, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Aug 18, 2022

Uh oh!

drm-addoptio Aug 18, 2022 Author

drm-addoptio
Aug 17, 2022

Replies: 1 comment 1 reply

polm
Aug 18, 2022

drm-addoptio Aug 18, 2022
Author