Using Doc objects as an API between systems #7511

amitbeka · 2021-03-21T08:56:52Z

amitbeka
Mar 21, 2021

Hi there,

I maintain and develop a system where SpaCy is only one model in a big graph of models, some ML-based and some are not.
In addition, some models are actually hosted remotely, and we use an HTTP call to send and receive predictions.
We see the Doc object as being a very convenient method to actually communicate between these models, both remotely
(serializing to JSON) and internally, as a form of API object which all models send and receive.

However, when actually trying to use it that way, a few questions appear.

First, It seems the Doc.to_dict() method outputs a complete object (with extensions and spans) but half binary, and it depends on knowing the Vocab and Language objects. OTOH Doc.to_json() outputs a very partial, but independent, object (no spans and extensions).

Is there something I miss? I'm looking for a way to get a complete and independent object.

Second, I wondered if there is some slimmed-down version of either SpaCy or the Doc object, where systems that need to read it (and modify it slightly, if at all), can get away without all the dependencies that come with SpaCy. Some models I have might be based on different python libraries, that sometimes conflict with the dependencies of SpaCy itself, or just very slim Kubernetes images I don't want to increase in size.

I wondered if someone is aware of a "Doc-as-a-dataclass" implementation or something similar that enables to use the Doc "schema" and structure alone.

Thanks,
Beka

Answered by adrianeboyd

Mar 22, 2021

If you're storing groups of docs, we'd recommend the DocBin class: https://spacy.io/api/docbin . It's very similar to Doc.to_dict() but collapses the shared strings.

If you're using Doc.to_dict() for single docs you may want to exclude tensor, which you usually don't need after the pipeline has run.

If you only want a table of token-level Doc annotation without any cats/extensions/spans/tensors, then use Doc.to_array(): https://spacy.io/api/doc#to_array. Here you do need to keep track of the vocab from the original pipeline because this doesn't save the strings.

I wouldn't recommend Doc.to_json, which outputs the v2 JSON training format. It's not particularly compact and doesn't include a…

View full answer

adrianeboyd · 2021-03-22T08:10:16Z

adrianeboyd
Mar 22, 2021

If you're storing groups of docs, we'd recommend the DocBin class: https://spacy.io/api/docbin . It's very similar to Doc.to_dict() but collapses the shared strings.

If you're using Doc.to_dict() for single docs you may want to exclude tensor, which you usually don't need after the pipeline has run.

If you only want a table of token-level Doc annotation without any cats/extensions/spans/tensors, then use Doc.to_array(): https://spacy.io/api/doc#to_array. Here you do need to keep track of the vocab from the original pipeline because this doesn't save the strings.

I wouldn't recommend Doc.to_json, which outputs the v2 JSON training format. It's not particularly compact and doesn't include all the annotation.

For DocBin, you need some vocab object when you're reloading the docs, but it can be a new vocab from a blank pipeline since DocBin includes all the strings:

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
db = DocBin().from_disk("docs.spacy")
docs = list(db.get_docs(nlp.vocab))

3 replies

amitbeka Mar 22, 2021
Author

Thanks for the reply. I assume I'll go with Doc.to_dict() with some modifications, since most of my use cases are single docs and I don't care that much about compactness.

Is there a way for me to get the spans and extensions in a way that is more JSON-like?

adrianeboyd Mar 23, 2021

Hmm, then maybe Doc.to_json is a better starting point. You can probably use something very similar to the ents part from Doc.to_json for doc.spans. If you have custom extensions, look likewise at Doc.to_json to see how the user data is handled.

We actually still store most of our internal training data in the v2 JSON training format for readability, but we don't have spans or custom extensions in that data (for spans: yet). If you're interested in working on this, we'd be happy to accept a PR that adds doc.spans to Doc.to_json and json_to_docs.

amitbeka Mar 24, 2021
Author

Cool, we'll do :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using Doc objects as an API between systems #7511

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using Doc objects as an API between systems #7511

Uh oh!

amitbeka Mar 21, 2021

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Mar 22, 2021

Uh oh!

amitbeka Mar 22, 2021 Author

Uh oh!

adrianeboyd Mar 23, 2021

Uh oh!

amitbeka Mar 24, 2021 Author

amitbeka
Mar 21, 2021

Replies: 1 comment 3 replies

adrianeboyd
Mar 22, 2021

amitbeka Mar 22, 2021
Author

amitbeka Mar 24, 2021
Author