Issue with Relation Extraction Model Implementation using Doccano Dataset #12777
-
I am trying to run the relation extraction example of Spacy. I have a dataset that I created by using Doccano; it has a different format than Prodigy. For example, Doccano is not generating a whitespace ('ws') key. So, I customized the parse_data.py and generated a '.spacy' file. After that, I started the training without any warnings or errors. At the end of the training, I realized that all the evaluation scores were equal to zero. After some debugging, I realized that rel_component requires token_start and token_end to locate the entities. However, Doccano doesn't produce them. To be more precise, here is an example annotation: Example sentence: Inventor of electrical age Nikola Tesla was born in 10 July 1856. What I can get from Prodigy:
Doccano Generates:
Because of that, the evaluation script is looking for an entity with an id of 32; however, I have something like 319827. I couldn't find any way to change my dataset's unique id to start_offset. So I am searching for a way to edit rel_component. How can I change the scoring script of rel_component so it will search for start_offset instead of id. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi @wallybeamm! I think the key to solving your problem is here:
This I understand that in the Doccano format, you don't immediately have access to the token indices (
then afterwards you can obtain the token indices by querying So basically it's OK that Doccano doesn't provide those - you can define the entity on spaCy's Hopefully that works :-) |
Beta Was this translation helpful? Give feedback.
-
If anyone else wants to use Doccano in their project, here is my final
|
Beta Was this translation helpful? Give feedback.
Hi @wallybeamm!
I think the key to solving your problem is here:
This
.spacy
file should, for each of your documents, contain the entities indoc.ents
and the relations indoc._.rel
. Your approach to modifyingparse_data.py
is the right way to go about this: basically you need to adjust this script so it parses your Doccano files instead of the Prodigy format.I understand that in the Doccano format, you don't immediately have access to the token indices (
token_start
andtoken_end
). However, if you're creating theDoc
object and setting the entities indoc.ents
via their character indicesstart_offset
andend_offset