dataset preprocess: convert old json format to spaCy binary format #6435

swicaksono · 2020-11-24T08:09:22Z

swicaksono
Nov 24, 2020

I have this old format of spacy Ner data train that looks like this following on the NER training documentation. My question is how to preprocess this into spaCy dataset binary format without given the spans key as I found in this resource of ner_drugs tutorial on spaCy v3 nightly?

Thanks.

Answered by adrianeboyd

Nov 24, 2020

The ner_drugs data JSONL format is from prodigy and isn't the same as the simple training format used in the v2 examples.

Here's how it could look for v3:

import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding

nlp = spacy.blank("en")
nlp.add_pipe("ner")

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London.", {"entities": [(7, 13, "LOC")]}),
]
examples = []
for text, annots in TRAIN_DATA:
    examples.append(Example.from_dict(nlp.make_doc(text), annots))

nlp.initialize(lambda: examples)

for i in range(20):
    random.shuffle(examples)
    for batch in minibatch(examples, size=2):
        p…

View full answer

adrianeboyd · 2020-11-24T12:50:18Z

adrianeboyd
Nov 24, 2020

The ner_drugs data JSONL format is from prodigy and isn't the same as the simple training format used in the v2 examples.

Here's how it could look for v3:

import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding

nlp = spacy.blank("en")
nlp.add_pipe("ner")

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London.", {"entities": [(7, 13, "LOC")]}),
]
examples = []
for text, annots in TRAIN_DATA:
    examples.append(Example.from_dict(nlp.make_doc(text), annots))

nlp.initialize(lambda: examples)

for i in range(20):
    random.shuffle(examples)
    for batch in minibatch(examples, size=2):
        print(nlp.update(batch))

The example in the docs will be updated to look more like this when #6438 is merged: https://nightly.spacy.io/usage/v3#migrating-training-python. Some of the other migration notes in this section might be helpful for you, too.

In general, we'd strongly suggest moving away from the simple training scripts and using the spacy train command instead. You can save your examples as a DocBin/.spacy file like this:

from spacy.tokens import DocBin

db = DocBin(docs=[ex.reference for ex in examples])
db.to_disk("/path/to/train.spacy")

0 replies

swicaksono · 2020-11-25T03:44:24Z

swicaksono
Nov 25, 2020
Author

I see. @adrianeboyd thank you for the tutorial!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset preprocess: convert old json format to spaCy binary format #6435

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

dataset preprocess: convert old json format to spaCy binary format #6435

Uh oh!

swicaksono Nov 24, 2020

Replies: 2 comments

Uh oh!

adrianeboyd Nov 24, 2020

Uh oh!

swicaksono Nov 25, 2020 Author

swicaksono
Nov 24, 2020

adrianeboyd
Nov 24, 2020

swicaksono
Nov 25, 2020
Author