dataset preprocess: convert old json format to spaCy binary format #6435
-
|
I have this old format of spacy Ner data train that looks like this following on the NER training documentation. My question is how to preprocess this into spaCy dataset binary format without given the spans key as I found in this resource of ner_drugs tutorial on spaCy v3 nightly? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
The Here's how it could look for v3: import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding
nlp = spacy.blank("en")
nlp.add_pipe("ner")
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London.", {"entities": [(7, 13, "LOC")]}),
]
examples = []
for text, annots in TRAIN_DATA:
examples.append(Example.from_dict(nlp.make_doc(text), annots))
nlp.initialize(lambda: examples)
for i in range(20):
random.shuffle(examples)
for batch in minibatch(examples, size=2):
print(nlp.update(batch))The example in the docs will be updated to look more like this when #6438 is merged: https://nightly.spacy.io/usage/v3#migrating-training-python. Some of the other migration notes in this section might be helpful for you, too. In general, we'd strongly suggest moving away from the simple training scripts and using the from spacy.tokens import DocBin
db = DocBin(docs=[ex.reference for ex in examples])
db.to_disk("/path/to/train.spacy") |
Beta Was this translation helpful? Give feedback.
-
|
I see. @adrianeboyd thank you for the tutorial! |
Beta Was this translation helpful? Give feedback.
The
ner_drugsdata JSONL format is from prodigy and isn't the same as the simple training format used in the v2 examples.Here's how it could look for v3: