Skip to content
Discussion options

You must be logged in to vote

If you're storing groups of docs, we'd recommend the DocBin class: https://spacy.io/api/docbin . It's very similar to Doc.to_dict() but collapses the shared strings.

If you're using Doc.to_dict() for single docs you may want to exclude tensor, which you usually don't need after the pipeline has run.

If you only want a table of token-level Doc annotation without any cats/extensions/spans/tensors, then use Doc.to_array(): https://spacy.io/api/doc#to_array. Here you do need to keep track of the vocab from the original pipeline because this doesn't save the strings.

I wouldn't recommend Doc.to_json, which outputs the v2 JSON training format. It's not particularly compact and doesn't include a…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@amitbeka
Comment options

@adrianeboyd
Comment options

@amitbeka
Comment options

Answer selected by amitbeka
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / doc Feature: Doc, Span and Token objects
2 participants