How to fine-tune spacy-experimental "en-coreference-web-trf" model on my own custom domain dataset #12711
-
I have a custom dataset of conversational data specific to farming domain. The spacy-experimental coreference model (en-coreference-web-trf) does perform okish in coreference resolution but does not give the required accuracy. So I need to further fine-tune this model on my domain specific data. I have found this repo that allows you train a spacy coref model but I am having trouble following the instructions provided. I have my own custom data, but I do not know in what format it should be to send it for training. The repo says, the project is about training the model using OntoNotes dataset, so I thought I can convert my dataset into the format of OntoNotes, but OntoNotes itself is not a public dataset so I do not its file structure. I couldn't find any other resources apart from the specified repo related to my task. Please provide me instructions on how to fine-tune the model. Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
You can obtain the CoNLL12 version of OntoNotes from https://huggingface.co/datasets/conll2012_ontonotesv5 preparing your own data in this format is a non-trivial undertaking and I do not think that all the fields are using in training the coref model. You probably need to check through the preparation scripts in the repo to backtrace what is being used in the training of the coreference model. |
Beta Was this translation helpful? Give feedback.
-
Hey there, The OntoNotes data itself is quite hard to work with in many ways unfortunately and as you say its not publicly available. However, to understand how you need to format your data to work with the python -m spacy project assets --extra should download the LitBank dataset into the directory
preprocesses a single file You can take a look at the import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
docs = list(DocBin().from_disk("corpus/train.spacy").get_docs(nlp.vocab)) I hope this unblocks you with formatting your data! |
Beta Was this translation helpful? Give feedback.
Hey there,
The OntoNotes data itself is quite hard to work with in many ways unfortunately and as you say its not publicly available. However, to understand how you need to format your data to work with the
coref
you do not need to deal with it necessarily. Runningshould download the LitBank dataset into the directory
assets/litbank
. Then runningpreprocesses a single file
assets/litbank/95_the_prisoner_of_zenda_brat.conll
using thescripts/preprocess.py
script.You can take a look at the
.conll
formatted files inassets/litbank
to see what kind of format thescripts.preprocess.py
expects. Y…