Train Discontinuous Entity Spans - BRAT generated Annotations #10879

AashishTiwari · 2022-05-30T14:57:54Z

AashishTiwari
May 30, 2022

Hi, I am trying to train a Named Entity Recognition (NER) model on a BRAT Annotated dataset. Looking at the NER training data format supported by spaCy, it looks like it recognizes continuous entity spans only.
BRAT support discontinuous data annotation capability [REF: https://github.com/nlplab/brat/issues/362 ]
An example of annotation and actual line of text is depicted below:

Text Line:

Recipients of previous non-renal solid organ and/or islet cell transplantation.

Annotation:

T5	Procedure 126 147;166 181	non-renal solid organ transplantation

Description: This file is tab separated values (TSV), first column is Row Type (Entry Type, i.e. whether a Term or Relation) second is the actual entity type then comes the span of text. From the example we can see entity span start index is 126 and end index is 147 and the second part which starts at 166 and ends at 181.

My question (need suggestion) : Is this possible to achieve with spaCy? if Yes, any reference pseudo-code / implementation where this has been achieved in the past? If No, are we planning to add this capability in future?

Thanks!

Answered by ljvmiranda921

May 31, 2022

Hi, @AashishTiwari

Unfortunately, we don't support discontinuous spans, and we're not planning to. One way you can approach this problem
is to treat them as separate entities, perform NER / Span Categorization, then do the post-processing afterward (i.e., recombining the tokens that belong to the same entity using some rule or logic). Another option is to combine the entities, perform NER, and split them in post.

In addition, if you are to use spaCy for training, you can first convert your BRAT files into the ConLL format and use the convert command to turn them into spaCy files.

View full answer

ljvmiranda921 · 2022-05-31T05:40:24Z

ljvmiranda921
May 31, 2022

Hi, @AashishTiwari

Unfortunately, we don't support discontinuous spans, and we're not planning to. One way you can approach this problem
is to treat them as separate entities, perform NER / Span Categorization, then do the post-processing afterward (i.e., recombining the tokens that belong to the same entity using some rule or logic). Another option is to combine the entities, perform NER, and split them in post.

In addition, if you are to use spaCy for training, you can first convert your BRAT files into the ConLL format and use the convert command to turn them into spaCy files.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Train Discontinuous Entity Spans - BRAT generated Annotations #10879

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Train Discontinuous Entity Spans - BRAT generated Annotations #10879

Uh oh!

AashishTiwari May 30, 2022

Replies: 1 comment

Uh oh!

ljvmiranda921 May 31, 2022

AashishTiwari
May 30, 2022

ljvmiranda921
May 31, 2022