Pipeline issues when text has extra spaces between words #10726
Unanswered
jordireinsma
asked this question in
Help: Coding & Implementations
Replies: 1 comment 2 replies
-
The only way to get the exact same results is to preprocess the text for spacy while keeping track of the modifications, and then map the annotations back to the version of the text with whitespace. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently stuck in this problem which comes from the way Tokenization works (pt_core_news_sm model):
How do we customize the spaCy Tokenizer to use that
filter_spaces
, so that the following pipelines are not affected by the SPACE token in the Doc object? Or, in other words, how to be sure thatget_info
returns the same results for non-space tokens regardless of extra spaces in text?Beta Was this translation helpful? Give feedback.
All reactions