Skip to content
Discussion options

You must be logged in to vote

Rather than saving the whole vocab, the DocBin saves the strings used in each doc so it can reconstruct all tokens / annotation.

A Doc is always created with an associated Vocab, so when you provide a vocab to DocBin.get_docs, this is the vocab associated with the returned doc and it's where all the strings are added again.

To save space, the strings are saved as one set for all docs, so you can't easily extract just the strings associated with one doc, so you couldn't really remove a doc cleanly. It could be possible to have a slightly unsatisfactory remove that didn't clean up the strings.

It's totally fine to use a different vocab with DocBin.get_docs. To order to end up with the same …

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Phat-Loc
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects
2 participants