Speedup initialization of pipeline #7760
Replies: 1 comment 1 reply
-
By default, [
"A",
"B"
] The relevant part of the [initialize.components.textcat]
[initialize.components.textcat.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false Be aware that the training loop reads the whole train corpus into memory by default, which can be a problem for really large train corpora. We're working on making this more flexible in an upcoming release (#7208), which should be in the upcoming v3.0.6. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Whenever I run
python -m spacy train
with GPU enabled and with atrain
anddev
corpus that have 1M and 300K docs respectively, it is very slow when starting up.The log is stuck for 5-15 minutes in the
Finished initializing nlp object
initialization part of the training, for example:I am wondering what actually happens behind the scenes and what are the factors that could help speed up this part.
My theory is that there is a large memory transfer going into the GPU thus why there is minutes worth of waiting when initializing the pipeline.
EDIT: I am running spaCy on a Windows 10 x64 laptop, with Intel i7-8550U 20GB RAM and MX150 w/ 4GB RAM GPU.
Beta Was this translation helpful? Give feedback.
All reactions