Using merge_entities pipe with abbreviation detector #10658

sofdog-gh · 2022-04-14T13:35:58Z

sofdog-gh
Apr 14, 2022

Hi,

I have been having problems using the merge_entites pipe along with the abbreviation detector. I get the following error:

/usr/local/lib/python3.7/dist-packages/spacy/tokens/doc.pyx in spacy.tokens.doc.bounds_check()
**IndexError: [E026]** Error accessing token at position 129: out of bounds in Doc of length 120.

This is the part of the code that causes the error:

        for abbrev in doc._.abbreviations:
            long_form = abbrev._.long_form
            abbrev_key = str(abbrev).strip() #causes error
            abbreviation_long_form[abbrev_key] = str(long_form).strip()

On a conceptual level, I understand that by merging entities I change the length of the doc (the number of tokens - as even entities that are made up of multiple words as merge together into one Spacy token), but on a technical level I don't know how to solve this. Also, the error is raised by the line abbrev_key = str(abbrev).strip() and I don't understand why the problem is not raised when calling the for loop.

This is what my pipeline looks like:

['tok2vec',
 'abbreviation_detector',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'scispacy_linker',
 'merge_entities']

The model I am using: en_core_sci_md-0.4.0 (Scispacy)
OS is Linux, but the error occurs when running on Google Colab too.

Would be very grateful if someone could help me out. Thanks a lot.
S.

--- EDITS ---

FULL CODE:

import spacy
import scispacy
import en_core_sci_md
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_md")

nlp.add_pipe("merge_entities")
nlp.add_pipe("abbreviation_detector", after='tok2vec')


abbreviation_long_form = {}
ents_info = {}

text = "Rift Valley Fever Virus (RVFV) is an emerging zoonotic pathogen transmitted to humans and livestock through mosquito bites, which was first isolated in Kenya in 1930. The virus is classified by the WHO among the pathogens for which there is an urgent need to develop research, diagnostics, and therapies. However, the efforts developed to control the virus remain limited, and the virus is not well characterized. In this article, we will introduce RVFV and then focus on its virulence factor, the nonstructural protein NSs. We will mainly discuss the ability of this viral protein to form amyloid-like fibrils and its implication in the neurotoxicity associated with RVFV infection."

doc = nlp(text)

for abbrev in doc._.abbreviations:
    long_form = abbrev._.long_form
    abbrev_key = str(abbrev).strip()
    abbreviation_long_form[abbrev_key] = str(long_form).strip()

for ent in doc.ents:
    #ent: <class 'spacy.tokens.span.Span'>
    ent_as_token = ent[0] #<class 'spacy.tokens.token.Token'>

    dep_tag = ent_as_token.dep_
    start_ch = int(ent.start_char)
    end_ch = int(ent.end_char)
    ent_tuple = (start_ch, end_ch, dep_tag)

    ent_key = str(ent).strip()
    
    if ent_key in ents_info:
        ents_info[ent_key].append(ent_tuple)
    else:
        ents_info[ent_key] = [ent_tuple]

FULL ERROR MESSAGE:

IndexError                                Traceback (most recent call last)
[<ipython-input-15-6507bec5fcea>](https://localhost:8080/#) in <module>()
      7 for abbrev in doc._.abbreviations:
      8     long_form = abbrev._.long_form
----> 9     abbrev_key = str(abbrev).strip()
     10     abbreviation_long_form[abbrev_key] = str(long_form).strip()
     11 

5 frames
/usr/local/lib/python3.7/dist-packages/spacy/tokens/span.pyx in spacy.tokens.span.Span.__repr__()

/usr/local/lib/python3.7/dist-packages/spacy/tokens/span.pyx in spacy.tokens.span.Span.text.__get__()

/usr/local/lib/python3.7/dist-packages/spacy/tokens/span.pyx in spacy.tokens.span.Span.text_with_ws.__get__()

/usr/local/lib/python3.7/dist-packages/spacy/tokens/span.pyx in __iter__()

/usr/local/lib/python3.7/dist-packages/spacy/tokens/doc.pyx in spacy.tokens.doc.Doc.__getitem__()

/usr/local/lib/python3.7/dist-packages/spacy/tokens/doc.pyx in spacy.tokens.doc.bounds_check()

IndexError: [E026] Error accessing token at position 119: out of bounds in Doc of length 114.

Answered by polm

Apr 15, 2022

Just as a note, scispacy is a separate project, and the abbreviation detector is a part of it. You might have more luck asking at their repo.

Also there's not enough information here to debug this I think.

What code did you actually run that gave you this error? Was it just an nlp call or accessing attributes later? Please include a sample we (or whoever) can run to reproduce your error.
Include the full stack trace (error output), not just the last part of it.

View full answer

polm · 2022-04-15T04:49:12Z

polm
Apr 15, 2022

Just as a note, scispacy is a separate project, and the abbreviation detector is a part of it. You might have more luck asking at their repo.

Also there's not enough information here to debug this I think.

What code did you actually run that gave you this error? Was it just an nlp call or accessing attributes later? Please include a sample we (or whoever) can run to reproduce your error.
Include the full stack trace (error output), not just the last part of it.

3 replies

sofdog-gh Apr 18, 2022
Author

Hi @polm,

Thanks a lot for wanting to help. I have updated my question so that it includes the full code (a sample that allows others to reproduce the problem) and the full error message. The problem arises when I use the merge_entities pipe together with the abbreviation_detector. The reason why I want to use the merge_entities pipe is that I need to extract the dependency tag of entities (and perhaps the POS tag too); I know both are linguistically imprecise for multiword entities, but I thought this would be helpful information for later Relation Extraction.

I have not found a discussion forum specifically for Scispacy - might try StackOverflow.

polm Apr 19, 2022

OK, it looks like the issue is that you're putting the abbreviation detector before entity merging. So the abbreviation detector finds some token indices and saves them and then you retokenize the doc, so those indices aren't valid any more. If you remove the after arg from where you add the abbreviation detector your code just works.

sofdog-gh Apr 20, 2022
Author

@polm Silly mistake, thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using merge_entities pipe with abbreviation detector #10658

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using merge_entities pipe with abbreviation detector #10658

Uh oh!

Uh oh!

sofdog-gh Apr 14, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Apr 15, 2022

Uh oh!

sofdog-gh Apr 18, 2022 Author

Uh oh!

polm Apr 19, 2022

Uh oh!

sofdog-gh Apr 20, 2022 Author

sofdog-gh
Apr 14, 2022

Replies: 1 comment 3 replies

polm
Apr 15, 2022

sofdog-gh Apr 18, 2022
Author

sofdog-gh Apr 20, 2022
Author