Newbie question about NER and training dataset #11288

MagiCsito · 2022-08-09T22:39:22Z

MagiCsito
Aug 9, 2022

Hello all,

I looked for this answer a lot of times and i never found it.
In the official documentation say that the better way to train a model is with the .spacy object instead of a json, however I always see in the examples of the documentation that you are using "TRAINING DATA" variable like this one [https://spacy.io/usage/training#training-data] .

Is there a way to "translate" from a .spacy dataset to the format used in th example before? Do you have any function to do this? (I am trying to update a NER model and begining to understand it but all the examples that I saw, they are using a variable like "Training data" instead of a spacy file.

Thank you

Answered by polm

Aug 10, 2022

We don't have a function to go from a DocBin (.spacy file) to the simple TRAINING_DATA format, but you can do so simply enough with a function:

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = []

db = DocBin().from_disk("train.spacy")
for doc in db.get_docs(nlp.vocab):
    annotations = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    training_data.append( (doc.text, annotations) )

There's nothing special about the simple training format; the example uses it because you probably don't have .spacy files already and you'll need to convert whatever other annotations you have, so it's just an example of how to do that with relatively …

View full answer

polm · 2022-08-10T03:28:14Z

polm
Aug 10, 2022

We don't have a function to go from a DocBin (.spacy file) to the simple TRAINING_DATA format, but you can do so simply enough with a function:

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = []

db = DocBin().from_disk("train.spacy")
for doc in db.get_docs(nlp.vocab):
    annotations = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    training_data.append( (doc.text, annotations) )

There's nothing special about the simple training format; the example uses it because you probably don't have .spacy files already and you'll need to convert whatever other annotations you have, so it's just an example of how to do that with relatively simple input data.

1 reply

MagiCsito Aug 10, 2022
Author

Really thank you polm!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Newbie question about NER and training dataset #11288

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Newbie question about NER and training dataset #11288

Uh oh!

MagiCsito Aug 9, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Aug 10, 2022

Uh oh!

MagiCsito Aug 10, 2022 Author

MagiCsito
Aug 9, 2022

Replies: 1 comment 1 reply

polm
Aug 10, 2022

MagiCsito Aug 10, 2022
Author