Using Doc objects as an API between systems #7511
-
Hi there, I maintain and develop a system where SpaCy is only one model in a big graph of models, some ML-based and some are not. However, when actually trying to use it that way, a few questions appear. First, It seems the Is there something I miss? I'm looking for a way to get a complete and independent object. Second, I wondered if there is some slimmed-down version of either SpaCy or the Doc object, where systems that need to read it (and modify it slightly, if at all), can get away without all the dependencies that come with SpaCy. Some models I have might be based on different python libraries, that sometimes conflict with the dependencies of SpaCy itself, or just very slim Kubernetes images I don't want to increase in size. I wondered if someone is aware of a "Doc-as-a-dataclass" implementation or something similar that enables to use the Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
If you're storing groups of docs, we'd recommend the If you're using If you only want a table of token-level I wouldn't recommend For import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
db = DocBin().from_disk("docs.spacy")
docs = list(db.get_docs(nlp.vocab)) |
Beta Was this translation helpful? Give feedback.
If you're storing groups of docs, we'd recommend the
DocBin
class: https://spacy.io/api/docbin . It's very similar toDoc.to_dict()
but collapses the shared strings.If you're using
Doc.to_dict()
for single docs you may want to excludetensor
, which you usually don't need after the pipeline has run.If you only want a table of token-level
Doc
annotation without any cats/extensions/spans/tensors, then useDoc.to_array()
: https://spacy.io/api/doc#to_array. Here you do need to keep track of the vocab from the original pipeline because this doesn't save the strings.I wouldn't recommend
Doc.to_json
, which outputs the v2 JSON training format. It's not particularly compact and doesn't include a…