Tokenizer not splitting by infix in some cases #13085
-
How to reproduce the behaviourI would like the tokenizer to split by nearly any punctuation symbol, and I am having issues in some weird cases. I initialize the tokenizer this way:
But, although the dot is set as an infix, I get this:
I can't understand why '.2014' is output as a token and is not split in '.' and '2014' Is there something weird going on there? Or am I missing something? Any help is appreciated Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
The infix matching skips matches that start at index 0 in the token string. Could you match this as a prefix instead (probably still in addition to the infix matching)? |
Beta Was this translation helpful? Give feedback.
-
Let me convert this to a discussion... |
Beta Was this translation helpful? Give feedback.
-
You are right, after adding the dot as a prefix too, it worked as expected:
Thank you very much! |
Beta Was this translation helpful? Give feedback.
The infix matching skips matches that start at index 0 in the token string. Could you match this as a prefix instead (probably still in addition to the infix matching)?