Getting start and end indices of entities with respect to the sentence within a spacy doc #10703

shrinidhin · 2022-04-25T12:13:37Z

shrinidhin
Apr 25, 2022

Hi!So I am trying to implement an entity linker for my custom trained NER. I am using the benepar parser to split the sentences in my document and get the indices of the sentences. Now I need the indices of the entities w.r.t the sentence. So I used the following line of code:

current_span=d[index_start:index_end]
for entity in current_span.ents:
      entity_start=entity.start_char-entity.sent.start_char
      entity_end=entity.end_char-entity.sent.start_char
      ent_info['ent_indices']=(entity_start, entity_end)

where current_span is the slice or sentence or part of the document, whose indices are obtained by using benepar parser.
It seems to work correctly for a few instances but for some indices are incorrect. Is this the correct way to approach this or am I missing something?

Answered by ljvmiranda921

Apr 26, 2022

Hi @shrinidhin , can you provide a more complete example so we can see what's going on? It's still difficult to pinpoint where the problem is from the sample code. Perhaps, one thing you can try is do span.as_doc() for the sentence spans, then obtain the entities to get the offsets you want. That might give you the correct indices.

View full answer

ljvmiranda921 · 2022-04-26T10:46:37Z

ljvmiranda921
Apr 26, 2022

Hi @shrinidhin , can you provide a more complete example so we can see what's going on? It's still difficult to pinpoint where the problem is from the sample code. Perhaps, one thing you can try is do span.as_doc() for the sentence spans, then obtain the entities to get the offsets you want. That might give you the correct indices.

5 replies

shrinidhin Apr 26, 2022
Author

Hi!This is the exact code. I basically want to store the entity spans data for preparing the training data for my Entity Linker component.

data=[]
split_indices = benepar_split(d)
            for index_start,index_end in split_indices:
                current_span=d[index_start:index_end]
                if len(current_span.ents) !=0:
                    record=dict()
                    record['text']=str(d[index_start:index_end])    
                    record['ent_info']=[]  
                    for entity in current_span.ents:
                        ent_info=dict() 
                        ent_info['ent_name']=entity.text
                        entity_start=entity.start_char-entity.sent.start_char
                        entity_end=entity.end_char-entity.sent.start_char
                        ent_info['ent_indices']=(entity_start, entity_end)  
                        ent_info['ent_label']=entity.label_
                        ent_info['KB_ID']=''
                        record['ent_info'].append(ent_info)
                    data.append(record)


def benepar_split(doc: Doc) -> List[Tuple]:
        """Split a doc into individual clauses
        doc (Doc): Input doc containing one or more sentences
        RETURNS (List[Tuple]): List of extracted clauses, defined by their start-end offsets
        """
        split_indices = []
        for sentence in doc.sents:
            can_split = False
            for constituent in sentence._.constituents:
                # Store start/end indices of clauses labeled "S" (Sentence) if their parent is the original sentence
                if "S" in constituent._.labels and constituent._.parent == sentence:
                    split_indices.append((constituent.start, constituent.end))
                    can_split = True

            # If no clause found, append the start/end indices of the whole sentence
            if not can_split:
                split_indices.append((sentence.start, sentence.end))

        return split_indices

The issue here is when I try to create the .spacy data files for training the entitty linker, like the script create_corpus.py, My doc.char_span returns None. This is how the code is:

for entities in example['ent_info']:
     if entities['KB_ID'] != "":
           entity=doc.char_span(
                  entities['ent_indices'][0],
                  entities['ent_indices'][1],
                   label=entities['ent_label'],
                   kb_id=entities['KB_ID']
                       )

ljvmiranda921 Apr 27, 2022

So usually char_span returns None if the character span doesn't map directly to valid tokens. You can first try debugging the output of the benepar_split function, and compare the returned indices to the original doc you want to assign entities onto. Also, the constituents extension already provides you with an iterator of Spans, perhaps you can use this directly instead (i.e, assigning them into doc.ents)? Lastly, you might also want to check what kind of indices are provided, are they token or character indices?

shrinidhin Apr 27, 2022
Author

So, I think that the indices I am getting is w.r.t the entire document and not the sentence. Split indices is giving the indices of each constituent i.e. sentence. Problm is I think the entity indices stored are w.r.t the original document of which the sentence is a part. Please check the following output:
This is the sentence constituent :

After hitting record highs in mid - February , the market has been trading in a broad range of 14,300 - 15,300 levels on the Nifty50 , despite strong momentum among global peers .

Entity picked is : Nifty50

print(entity.start, entity.end)   # (69 , 70)
print('Char indices , ', entity.start_char, entity.end_char) # Char indices 370,377

How do I obtain the entity indices w.r.t the sentence? One approach I can think of is creating a doc object using nlp('After hitting record highs in mid - February , the market has been trading in a broad range of 14,300 - 15,300 levels on the Nifty50 , despite strong momentum among global peers .'). But is it sensible to do this everytime when we can have 1000 sentences? Is there a better approach?

ljvmiranda921 Apr 28, 2022

You're on the right track. If I remember correctly, the sentence is returned as a Span, right? If that's the case then you can use Span.as_doc() instead, and the entities within that should be with respect to that particular sentence not the entire document.

shrinidhin May 1, 2022
Author

This absolutely worked.Thank you so much!

Uh oh!

Getting start and end indices of entities with respect to the sentence within a spacy doc #10703

Uh oh!

Uh oh!

shrinidhin Apr 25, 2022

Replies: 1 comment · 5 replies

Uh oh!

ljvmiranda921 Apr 26, 2022

Uh oh!

Uh oh!

shrinidhin Apr 26, 2022 Author

Uh oh!

ljvmiranda921 Apr 27, 2022

Uh oh!

Uh oh!

shrinidhin Apr 27, 2022 Author

Uh oh!

ljvmiranda921 Apr 28, 2022

Uh oh!

shrinidhin May 1, 2022 Author

shrinidhin
Apr 25, 2022

Replies: 1 comment 5 replies

ljvmiranda921
Apr 26, 2022

shrinidhin Apr 26, 2022
Author

shrinidhin Apr 27, 2022
Author

shrinidhin May 1, 2022
Author