Span alignment in spacy-huggingface-pipelines #12998
-
Hi,
I understand the issues when aligning between spaCy tokens and transformers tokens. From my understanding, In addition, when using the default alignment mode (strict), many entities are not returned because of this issue. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Underneath spans are defined over tokens rather than over characters, so there can still be misalignments with spans. I think what you might be seeing with If you don't care about the tokenization otherwise and just want the character span results, you could replace the default tokenizer with a character tokenizer. I think at that point there's a good chance that you don't get much advantage from going through spacy, but maybe it's useful? If you wanted to try it out, since it could at least be interesting for debugging, here's what that would look like (it's very very simple): |
Beta Was this translation helpful? Give feedback.
Underneath spans are defined over tokens rather than over characters, so there can still be misalignments with spans.
I think what you might be seeing with
expand
is that there's a previous annotation that's already been expanded over the tokenCarrasco
, and to make the processing+output the same fordoc.ents
anddoc.spans
, currently this component won't return overlapping annotation. Also, none of the underlying models produce overlapping annotation, so I think that would be unexpected.If you don't care about the tokenization otherwise and just want the character span results, you could replace the default tokenizer with a character tokenizer. I think at that point there's a good chance…