English tokenizer's behavior with "/" characters #10425
-
Can someone explain the logic in how spaCy's English tokenizer handles "words" that have '/' characters in them? Consider this example:
It splits the serial number on the first '/', but not on the second; Having observed that "12/AB/A3456-7" => [12] [/] [AB] [/] [A3456] [-] [7], my current hypothesis is that the shape "XX/d*" won't be split while "XX/[xX]*" will be. What is the motivation behind this behavior? I also have to deal with "//" as separators, which don't get seem to be split at all, e.g.
so I guess I'll have to set up a custom tokenizer to handle this kind of case. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
It's hard to say what the right tokenization is for complex sequences like this. The English tokenizer was designed to handle slashes with a particular awareness of dates, like If slashes are used in a particular way in your data, or less predictably, it could certainly make sense to customer the tokenizer behavior there. |
Beta Was this translation helpful? Give feedback.
It's hard to say what the right tokenization is for complex sequences like this. The English tokenizer was designed to handle slashes with a particular awareness of dates, like
12/01/71
, which are easier to handle if treated as a single token.If slashes are used in a particular way in your data, or less predictably, it could certainly make sense to customer the tokenizer behavior there.