Questions on DocBin #8149
-
I am curious about the design decisions behind DocBin
I assume it is probably to save space and prevent duplication of data. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Rather than saving the whole vocab, the A To save space, the strings are saved as one set for all docs, so you can't easily extract just the strings associated with one doc, so you couldn't really remove a doc cleanly. It could be possible to have a slightly unsatisfactory It's totally fine to use a different vocab with |
Beta Was this translation helpful? Give feedback.
Rather than saving the whole vocab, the
DocBin
saves the strings used in each doc so it can reconstruct all tokens / annotation.A
Doc
is always created with an associatedVocab
, so when you provide a vocab toDocBin.get_docs
, this is the vocab associated with the returned doc and it's where all the strings are added again.To save space, the strings are saved as one set for all docs, so you can't easily extract just the strings associated with one doc, so you couldn't really remove a doc cleanly. It could be possible to have a slightly unsatisfactory
remove
that didn't clean up the strings.It's totally fine to use a different vocab with
DocBin.get_docs
. To order to end up with the same …