Span alignment in spacy-huggingface-pipelines #12998

omri374 · 2023-09-21T07:59:41Z

omri374
Sep 21, 2023

Hi,
I'm using spacy-huggingface-pipelines to plug-in a transformers NER model into a spaCy pipeline.
I'm getting warnings about entities not aligned in docs (using alignment_mode=expand):

Evaluating <class 'spacy_huggingface_pipelines/token_classification.py:129: UserWarning: Skipping annotation, 
{'entity_group': 'LOC', 'score': 0.66918, 'word': 'co', 'start': 66, 'end': 68} is overlapping or can't be aligned for doc 
'I missed flight A415, can I board flight B415 from Barra de Carrasco to Port Taiyoumouth?'

I understand the issues when aligning between spaCy tokens and transformers tokens. From my understanding, spacy-huggingface-pipelines is using Doc.spans and not tokens. Is that correct? If yes, why do we still see those alignment issues?

In addition, when using the default alignment mode (strict), many entities are not returned because of this issue.

Answered by adrianeboyd

Sep 21, 2023

Underneath spans are defined over tokens rather than over characters, so there can still be misalignments with spans.

I think what you might be seeing with expand is that there's a previous annotation that's already been expanded over the token Carrasco, and to make the processing+output the same for doc.ents and doc.spans, currently this component won't return overlapping annotation. Also, none of the underlying models produce overlapping annotation, so I think that would be unexpected.

If you don't care about the tokenization otherwise and just want the character span results, you could replace the default tokenizer with a character tokenizer. I think at that point there's a good chance…

View full answer

adrianeboyd · 2023-09-21T09:50:34Z

adrianeboyd
Sep 21, 2023

Underneath spans are defined over tokens rather than over characters, so there can still be misalignments with spans.

I think what you might be seeing with expand is that there's a previous annotation that's already been expanded over the token Carrasco, and to make the processing+output the same for doc.ents and doc.spans, currently this component won't return overlapping annotation. Also, none of the underlying models produce overlapping annotation, so I think that would be unexpected.

If you don't care about the tokenization otherwise and just want the character span results, you could replace the default tokenizer with a character tokenizer. I think at that point there's a good chance that you don't get much advantage from going through spacy, but maybe it's useful? If you wanted to try it out, since it could at least be interesting for debugging, here's what that would look like (it's very very simple):

https://github.com/explosion/spacy-experimental/blob/da581da6f4c5de8c63924642f7fc5f0bd281958c/spacy_experimental/char_tokenizer/char_pretokenizer.py#L9-L25

3 replies

omri374 Sep 26, 2023
Author

Thank you for the detailed answer. The reason I'm going through spaCy, is because I'd like to get several attributes out: NER (using Transformers), tokens, lemmas etc. However it seems based on your answer that the only way to get all of those would be to run those in parallel. Would this be the recommended approach for this case?

adrianeboyd Sep 27, 2023

If you want tokens and lemmas then a normal spacy pipeline makes sense. In most cases I think you can review a reasonable chunk of the warnings to see what's going on and consider some options:

adjust the spacy tokenizer to line up better with the wordpiece boundaries (typically split a bit more on punctuation)
if the warnings are mostly cases like co above and they don't affect the final results (much), then filter the warning for this task

The warning is a little noisy, but I also didn't want it to silently return different results than the plain huggingface pipeline. (If you have other suggestions, feedback is welcome!)

omri374 Sep 27, 2023
Author

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Span alignment in spacy-huggingface-pipelines #12998

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Span alignment in spacy-huggingface-pipelines #12998

Uh oh!

Uh oh!

omri374 Sep 21, 2023

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Sep 21, 2023

Uh oh!

Uh oh!

omri374 Sep 26, 2023 Author

Uh oh!

adrianeboyd Sep 27, 2023

Uh oh!

omri374 Sep 27, 2023 Author

omri374
Sep 21, 2023

Replies: 1 comment 3 replies

adrianeboyd
Sep 21, 2023

omri374 Sep 26, 2023
Author

omri374 Sep 27, 2023
Author