Issue with Relation Extraction Model Implementation using Doccano Dataset #12777

wallybeamm · 2023-07-01T20:22:49Z

wallybeamm
Jul 1, 2023

I am trying to run the relation extraction example of Spacy. I have a dataset that I created by using Doccano; it has a different format than Prodigy. For example, Doccano is not generating a whitespace ('ws') key. So, I customized the parse_data.py and generated a '.spacy' file. After that, I started the training without any warnings or errors. At the end of the training, I realized that all the evaluation scores were equal to zero.

After some debugging, I realized that rel_component requires token_start and token_end to locate the entities. However, Doccano doesn't produce them. To be more precise, here is an example annotation:

Example sentence: Inventor of electrical age Nikola Tesla was born in 10 July 1856.

What I can get from Prodigy:

Entity Name: Nikola Tesla
_start_ (index of the first letter of the entity): 32
_end_ (index of the last letter of the entity): 44
_token_start_ (id of the first token of the entity): 5
_token_end_ (id of the last token of the entity): 6

Entity Date: 10 July 1856
_start_ (index of the first letter of the entity): 32
_end_ (index of the last letter of the entity): 44
_token_start_ (id of the first token of the entity): 10
_token_end_ (id of the last token of the entity): 12

_Relations_ | Head → 5 | Child → 10

Doccano Generates:

Entity Name: Nikola Tesla
_start_offset_ (index of the first letter of the entity): 32
_end_offset_ (index of the last letter of the entity): 44
_id_ (a unique id for the entity): 319827

Entity Date: 10 July 1856
_start_offset_ (index of the first letter of the entity): 32
_end_offset_ (index of the last letter of the entity): 44
_id_ (a unique id for the entity): 320612

_Relations_ | Head → 319827 | Child → 320612

Because of that, the evaluation script is looking for an entity with an id of 32; however, I have something like 319827. I couldn't find any way to change my dataset's unique id to start_offset. So I am searching for a way to edit rel_component.

How can I change the scoring script of rel_component so it will search for start_offset instead of id.

Answered by svlandeg

Jul 4, 2023

Hi @wallybeamm!

I think the key to solving your problem is here:

So, I customized the parse_data.py and generated a '.spacy' file.

This .spacy file should, for each of your documents, contain the entities in doc.ents and the relations in doc._.rel. Your approach to modifying parse_data.py is the right way to go about this: basically you need to adjust this script so it parses your Doccano files instead of the Prodigy format.

I understand that in the Doccano format, you don't immediately have access to the token indices (token_start and token_end). However, if you're creating the Doc object and setting the entities in doc.ents via their character indices start_offset and end_offset

entit…

View full answer

svlandeg · 2023-07-04T10:05:00Z

svlandeg
Jul 4, 2023

Hi @wallybeamm!

I think the key to solving your problem is here:

So, I customized the parse_data.py and generated a '.spacy' file.

This .spacy file should, for each of your documents, contain the entities in doc.ents and the relations in doc._.rel. Your approach to modifying parse_data.py is the right way to go about this: basically you need to adjust this script so it parses your Doccano files instead of the Prodigy format.

I understand that in the Doccano format, you don't immediately have access to the token indices (token_start and token_end). However, if you're creating the Doc object and setting the entities in doc.ents via their character indices start_offset and end_offset

entity = doc.char_span(span["start_offset"], span["end_offset"], label=span["label"])

then afterwards you can obtain the token indices by querying entity.start and entity.end.

So basically it's OK that Doccano doesn't provide those - you can define the entity on spaCy's Doc object first, then use spaCy's built-in methods to retrieve the token indices and use those to define the relations accordingly.

Hopefully that works :-)

1 reply

wallybeamm Jul 16, 2023
Author

I couldn't notice the entity.start and entity.end features before. I thought we only have entity.start_char and entity.end_char. Thanks for the help and recording tutorials 🥇.

wallybeamm · 2023-07-16T11:24:40Z

wallybeamm
Jul 16, 2023
Author

If anyone else wants to use Doccano in their project, here is my final custom_parse.py script. Note that the original rel_component project has a parse_data.py script which takes four inputs as json_loc: Path, train_file: Path, dev_file: Path, test_file: Path and produces 3 .spacy files. But this one output at a time.

# This script was derived from parse_data.py but made more generic as a template for various REL parsing needs.

import json
import random
import typer
from pathlib import Path

from spacy.tokens import DocBin, Doc
from spacy.vocab import Vocab
from wasabi import Printer
import spacy
from spacy.util import filter_spans
msg = Printer()

# TODO: define your labels used for annotation either as "symmetrical" or "directed"
SYMM_LABELS = []
DIRECTED_LABELS = ["itemSoldAtDate", ...]

# TODO: define splits for train/dev/test. What is not in test or dev, will be used as train.
test_portion = 0.2
dev_portion = 0.3

# TODO: set this bool to False if you didn't annotate all relations in all sentences.
# If it's true, entities that were not annotated as related will be used as negative examples.
is_complete = True


def main(json_loc: Path, train_file: Path):
    """Creating the corpus from the Prodigy annotations."""
    nlp = spacy.load("en_core_web_sm")

    Doc.set_extension("rel", default={})

    vocab = Vocab()
    docs = {"train": [], "dev": [], "test": []}
    count_all = {"train": 0, "dev": 0, "test": 0}
    count_pos = {"train": 0, "dev": 0, "test": 0}
    # json_loc = 'C:/Users/wallybeam/Desktop/thesis/rel_dataset/rel_dataset/re_train.jsonl'
    # train_file = './result_debug'
    with open(json_loc,'r', encoding="utf-8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            span_starts = set()
            span_lookup = {}

            if True:
                neg = 0
                pos = 0
                # Parse the tokens
                words = example["text"]
                
                doc = nlp(words)

                # Parse the entities
                spans = example["entities"]
                entities = []
                span_end_to_start = {}
                for span in spans:
                    entity = doc.char_span(
                        span["start_offset"], span["end_offset"], label=span["label"],
                        alignment_mode='expand'
                    )
                    span_end_to_start[entity.end] = entity.start
                    entities.append(entity)
                    span_lookup[str(span['id'])] = entity.end
                    span_starts.add(entity.start)
                if not entities:
                    msg.warn("Could not parse any entities from the JSON file.")
                doc.ents = filter_spans(entities) # THIS DOES THE TRICK

                # Parse the relations
                rels = {}
                for x1 in span_starts:
                    for x2 in span_starts:
                        rels[(x1, x2)] = {}
                relations = example["relations"]
                for relation in relations:

                    # the 'head' and 'child' annotations refer to the end token in the span
                    # but we want the first token
                    start_id = span_lookup[str(relation["from_id"])]
                    end_id = span_lookup[str(relation["to_id"])]

                    start = span_end_to_start[start_id]
                    end = span_end_to_start[end_id]

                    label = relation["type"]
                    if label not in SYMM_LABELS + DIRECTED_LABELS:
                        msg.warn(f"Found label '{label}' not defined in SYMM_LABELS or DIRECTED_LABELS - skipping")
                        break
                    if label not in rels[(start, end)]:
                        rels[(start, end)][label] = 1.0
                        pos += 1
                    if label in SYMM_LABELS:
                        if label not in rels[(end, start)]:
                            rels[(end, start)][label] = 1.0
                            pos += 1

                # If the annotation is complete, fill in zero's where the data is missing
                if is_complete:
                    for x1 in span_starts:
                        for x2 in span_starts:
                            for label in SYMM_LABELS + DIRECTED_LABELS:
                                if label not in rels[(x1, x2)]:
                                    neg += 1
                                    rels[(x1, x2)][label] = 0.0
                doc._.rel = rels

                # only keeping documents with at least 1 positive case
                if pos > 0:
                    # create the train/dev/test split randomly
                    # Note that this is not good practice as instances from the same article
                    # may end up in different splits. Ideally, change this method to keep
                    # documents together in one split (as in the original parse_data.py)
                    docs["train"].append(doc)
                    count_pos["train"] += pos
                    count_all["train"] += pos + neg

    docbin = DocBin(docs=docs["train"], store_user_data=True)
    docbin.to_disk(train_file)
    msg.info(
        f"{len(docs['train'])} training sentences, "
        f"{count_pos['train']}/{count_all['train']} pos instances."
    )

if __name__ == "__main__":
    typer.run(main)
```
`

1 reply

svlandeg Jul 17, 2023

Thanks for sharing! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue with Relation Extraction Model Implementation using Doccano Dataset #12777

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Issue with Relation Extraction Model Implementation using Doccano Dataset #12777

Uh oh!

wallybeamm Jul 1, 2023

Replies: 2 comments · 2 replies

Uh oh!

svlandeg Jul 4, 2023

Uh oh!

wallybeamm Jul 16, 2023 Author

Uh oh!

Uh oh!

wallybeamm Jul 16, 2023 Author

Uh oh!

svlandeg Jul 17, 2023

wallybeamm
Jul 1, 2023

Replies: 2 comments 2 replies

svlandeg
Jul 4, 2023

wallybeamm Jul 16, 2023
Author

wallybeamm
Jul 16, 2023
Author