Train Discontinuous Entity Spans - BRAT generated Annotations #10879
-
Hi, I am trying to train a Named Entity Recognition (NER) model on a BRAT Annotated dataset. Looking at the NER training data format supported by spaCy, it looks like it recognizes continuous entity spans only. Text Line:
Annotation:
Description: This file is tab separated values (TSV), first column is Row Type (Entry Type, i.e. whether a Term or Relation) second is the actual entity type then comes the span of text. From the example we can see entity span start index is 126 and end index is 147 and the second part which starts at 166 and ends at 181. My question (need suggestion) : Is this possible to achieve with spaCy? if Yes, any reference pseudo-code / implementation where this has been achieved in the past? If No, are we planning to add this capability in future? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, @AashishTiwari Unfortunately, we don't support discontinuous spans, and we're not planning to. One way you can approach this problem In addition, if you are to use spaCy for training, you can first convert your BRAT files into the ConLL format and use the convert command to turn them into spaCy files. |
Beta Was this translation helpful? Give feedback.
Hi, @AashishTiwari
Unfortunately, we don't support discontinuous spans, and we're not planning to. One way you can approach this problem
is to treat them as separate entities, perform NER / Span Categorization, then do the post-processing afterward (i.e., recombining the tokens that belong to the same entity using some rule or logic). Another option is to combine the entities, perform NER, and split them in post.
In addition, if you are to use spaCy for training, you can first convert your BRAT files into the ConLL format and use the convert command to turn them into spaCy files.