Memory usage when using Doc extensions #13566
Unanswered
makp
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Issue
Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I'd like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I'm not sure what I might be doing wrong.
Here is how I added the extension after initializing the Language class and adding the extension with
Doc.set_extension("idx", default=None)
. I runnlp.pipe
on my text and add the extensionidx
to each Doc:And when saving my data as a DocBin, I create the DocBin with
store_user_data=True
in order to save my extension:Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!
Further details
Beta Was this translation helpful? Give feedback.
All reactions