Skip to content
Discussion options

You must be logged in to vote

token_match is just designed to prevent things from being split if they would otherwise be split by infixes, it doesn't split out strings from larger tokens. You might be able to add your regex to the infixes list to get the behaviour you want, but looking at strings like the one you have where there are a bunch of concatenated tokens, I think running a regex over your raw text to surround any CVE strings with spaces as a preprocessing step will probably be the best solution.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@drm-addoptio
Comment options

Answer selected by drm-addoptio
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants