doc.set_ents uses tokens but Span uses char #7130
-
I'm trying to remove trailing special character at the end of entities. However, doc.set_ents uses token but Span uses character count. What's a good way to resolve this conflict?
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Hi! This is a good question for the discussion forum - so I'll move it there. You might get a message that this thread was closed/locked, but we can still continue the conversation on the open thread there, and it should automatically forward you. |
Beta Was this translation helpful? Give feedback.
-
You can call Also, please note that your custom component "remove_specials" should return the Finally, I'm not sure this function will work as you intend it when you apply it on a doc with multiple entities in |
Beta Was this translation helpful? Give feedback.
-
EDIT: (Thanks to @adrianeboyd for pointing out an error in my original response) I am certain that you are long past this question, but your pattern in |
Beta Was this translation helpful? Give feedback.
You can call
ent.start
andent.end
instead ofent.start_char
andent.end_char
to obtain the token indices of the entity instead of the char indices.Also, please note that your custom component "remove_specials" should return the
doc
at the end of its processing.Finally, I'm not sure this function will work as you intend it when you apply it on a doc with multiple entities in
doc.ents
, becausedoc.set_ents
always overwrites the entire set of entities. Instead, you probably want to build up the list of new entities and callset_ents
once on thedoc
right before returning it.