How to fine-tune spacy-experimental "en-coreference-web-trf" model on my own custom domain dataset #12711

ksgr5566 · 2023-06-11T11:46:11Z

ksgr5566
Jun 11, 2023

I have a custom dataset of conversational data specific to farming domain. The spacy-experimental coreference model (en-coreference-web-trf) does perform okish in coreference resolution but does not give the required accuracy. So I need to further fine-tune this model on my domain specific data.

I have found this repo that allows you train a spacy coref model but I am having trouble following the instructions provided. I have my own custom data, but I do not know in what format it should be to send it for training. The repo says, the project is about training the model using OntoNotes dataset, so I thought I can convert my dataset into the format of OntoNotes, but OntoNotes itself is not a public dataset so I do not its file structure. I couldn't find any other resources apart from the specified repo related to my task.

Please provide me instructions on how to fine-tune the model. Thank you.

Answered by kadarakos

Jun 14, 2023

Hey there,

The OntoNotes data itself is quite hard to work with in many ways unfortunately and as you say its not publicly available. However, to understand how you need to format your data to work with the coref you do not need to deal with it necessarily. Running

python -m spacy project assets --extra

should download the LitBank dataset into the directory assets/litbank. Then running

python -m spacy project run prep-artifical-unit-test-data

preprocesses a single file assets/litbank/95_the_prisoner_of_zenda_brat.conll using the scripts/preprocess.py script.

You can take a look at the .conll formatted files in assets/litbank to see what kind of format the scripts.preprocess.py expects. Y…

View full answer

awindsor · 2023-06-11T17:32:23Z

awindsor
Jun 11, 2023

You can obtain the CoNLL12 version of OntoNotes from

https://huggingface.co/datasets/conll2012_ontonotesv5

preparing your own data in this format is a non-trivial undertaking and I do not think that all the fields are using in training the coref model. You probably need to check through the preparation scripts in the repo to backtrace what is being used in the training of the coreference model.

1 reply

ksgr5566 Jun 13, 2023
Author

I have downloaded that dataset from huggingface, there are so many directories (depth) inside, in the repo it says "The top level directory should contain directories named arabic, chinese, english, and ontology" but found no such structure. Also this commnad "spacy project run prep-conll-data" is not mentioned in project.yml file but is there in README. I tried going through the repo but its too confusing to understand, without knowing what needs to happen.

kadarakos · 2023-06-14T08:52:40Z

kadarakos
Jun 14, 2023

Hey there,

The OntoNotes data itself is quite hard to work with in many ways unfortunately and as you say its not publicly available. However, to understand how you need to format your data to work with the coref you do not need to deal with it necessarily. Running

python -m spacy project assets --extra

should download the LitBank dataset into the directory assets/litbank. Then running

python -m spacy project run prep-artifical-unit-test-data

preprocesses a single file assets/litbank/95_the_prisoner_of_zenda_brat.conll using the scripts/preprocess.py script.

You can take a look at the .conll formatted files in assets/litbank to see what kind of format the scripts.preprocess.py expects. You can inspect the generated corpus/train.spacy to see how the coreference annotation is provided for the coref component:

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
docs = list(DocBin().from_disk("corpus/train.spacy").get_docs(nlp.vocab))

I hope this unblocks you with formatting your data!

5 replies

ksgr5566 Jun 14, 2023
Author

I will try this out, thank you. Just for clarification: after converting my data into the format present in train.spacy file, I can save it as a .spacy file itself then continue with the rest of the training instructions skipping data preprocessing part right?

Also how do I start my training from the weights of the pretrained model instead of training a new model from scratch, do I have to change anything in project.yml or config files for this?

kadarakos Jun 14, 2023

First you have to get your data into the .conll format that the LitBank files have and then you can run scripts/preprocess.py. You can also choose to just write your own preprocessing script that takes your own format and produces the information in the .spacy files. Perhaps that's an easier route.

If you look into the .spacy files you should check out the doc.spans where we store all the coreference clusters. There are coref_clusters_* groups and coref_head_clusters_*. The coref_clusters_* are the SpanGroup which are lists of Spans that are mentions of the same entity. The coref_head_clusters_* are SpanGroups where each Span has length 1 and these are the "heads" of the Spans defined in spaCy as Span.root.

We have these two kinds of groups of spans because the model first clusters the heads and then resolves the original spans from the heads. This is where the head clusters are created: https://github.com/explosion/projects/blob/v3/experimental/coref/scripts/preprocess.py#L94

One caveat is that the current coref system does not work with singletons i.e.: clusters with a single member.

ksgr5566 Jun 14, 2023
Author

I will proceed with trying out the suggestions and if I encounter any challenges along the way, I will reach out for further assistance.
Could you additionally advise me on the recommended approach for initializing training with the current "en-coreference-web-trf" model rather than starting from scratch?

kadarakos Jun 14, 2023

You'll need to source the the transformer, coref and span_resolver components from en_coreference_web_trf. You can see here we use sourced components in the coref projects for the assemble command: https://github.com/explosion/projects/blob/v3/experimental/coref/configs/coref.cfg#L27.

Please fine more information on sourcing components here: https://spacy.io/usage/processing-pipelines#sourced-components.

Jiya126 Jun 21, 2023

I'm encountering Permission error on running

python -m spacy project assets --extra

This is the error:

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\pcw\\Documents\\github\\co
ref\\assets\\.git\\objects\\pack\\pack-fec439f565c151ec2911b09a554714811b5b9b9a.idx'

I've tried running cmd as administrator and also tried giving admin properties to the mentioned file as well, but its showing no difference.

Uh oh!

How to fine-tune spacy-experimental "en-coreference-web-trf" model on my own custom domain dataset #12711

Uh oh!

Uh oh!

ksgr5566 Jun 11, 2023

Replies: 2 comments · 6 replies

Uh oh!

awindsor Jun 11, 2023

Uh oh!

Uh oh!

ksgr5566 Jun 13, 2023 Author

Uh oh!

Uh oh!

kadarakos Jun 14, 2023

Uh oh!

ksgr5566 Jun 14, 2023 Author

Uh oh!

Uh oh!

kadarakos Jun 14, 2023

Uh oh!

ksgr5566 Jun 14, 2023 Author

Uh oh!

kadarakos Jun 14, 2023

Uh oh!

Jiya126 Jun 21, 2023

ksgr5566
Jun 11, 2023

Replies: 2 comments 6 replies

awindsor
Jun 11, 2023

ksgr5566 Jun 13, 2023
Author

kadarakos
Jun 14, 2023

ksgr5566 Jun 14, 2023
Author

ksgr5566 Jun 14, 2023
Author