Custom Tokenizer Help #12018
-
Hi, I am creating a custom tokenizer in Spanish to execute the
I have done this including as prefixes, infixes and suffixes all the symbols available with Regex:
NOTE: I'm not sure if with this Regex syntax it performs exactly as I want. But in cases as this text fragment: I am pretty sure that this is happening because it interprets |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
This is a bit tricky because what you want to do here is remove an existing special case. First, you can figure out what's going on using the
You can remove a tokenizer exception by modifying
In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass where you control the exceptions directly. See the guide to language subclassing or the Spanish definition for an example of what that looks like. |
Beta Was this translation helpful? Give feedback.
This is a bit tricky because what you want to do here is remove an existing special case.
First, you can figure out what's going on using the
explain
method:SPECIAL
refers to a tokenizer exception in this case.You can remove a tokenizer exception by modifying
tokenizer.rules
, like this:In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass w…