SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced. #13404
Unanswered
dextde
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I add a custom tokenizer
splitter
as the first stage. It correctly splits the single token into two tokens.I then detect the two (splitted) tokens using a
SpanRuler
. Notice that the SpanRuler works for a pattern of two separated tokens (iepattern=['abc', 'efg']
), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg'
).Problem: However, when I print out the span text of the SpanRuler, the text refers to the single original token's text, not the two re-tokenized tokens' text (ie with a space in-between).
Notice the custom retokenizer does respect Spacy's non-destructive retokenization.
Actual Output:
Expected output:
Thanks for any help.
Beta Was this translation helpful? Give feedback.
All reactions