Incorrect infix tokenization of /" #10001
-
I came across a case where tokenization seems to fail on an adjacent slash quote between two words, such as ALPHA/"BRAVO" or ALPHA/"BRAVO CHARLIE". See example below. This is easy enough to preprocess (insert a space between adjacent slash and quote), I'm just unsure if this is a bug or expected behaviour. If it looks like a bug I'm happy to submit a full bug report, just don't want to waste anybody's time with a false alarm.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
This is the expected behavior for the current English tokenizer defaults. It currently only splits on spaCy/spacy/lang/punctuation.py Line 44 in 5ba4171 I think these defaults are intended to treat dates like You can certainly customize these settings for your own model, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions, Since this isn't the exact same tokenization the pipeline was trained on you might see a few more errors in tags and parses if you change this for If you're training a model from scratch, you'd want to modify the tokenizer settings like this: https://spacy.io/usage/training#custom-tokenizer |
Beta Was this translation helpful? Give feedback.
This is the expected behavior for the current English tokenizer defaults. It currently only splits on
/
as an infix between alpha+digit/
alpha:spaCy/spacy/lang/punctuation.py
Line 44 in 5ba4171
I think these defaults are intended to treat dates like
01/01/2022
differently fromABC/DEF
.You can certainly customize these settings for your own model, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions,
Since this isn't the exact same tokenization the pipeline was trained on you might see a few more errors in tags and parses if you change this for
en_core_web_sm
, but it probably only leads to minor di…