Speedup initialization of pipeline #7760

smic-datalabs-von · 2021-04-13T03:07:26Z

smic-datalabs-von
Apr 13, 2021

Whenever I run python -m spacy train with GPU enabled and with a train and dev corpus that have 1M and 300K docs respectively, it is very slow when starting up.

The log is stuck for 5-15 minutes in the Finished initializing nlp object initialization part of the training, for example:

=========================== Initializing pipeline ===========================
[2021-04-13 02:25:51,467] [INFO] Set up nlp object from config
[2021-04-13 02:25:51,481] [DEBUG] Loading corpus from path: corpus\dev.spacy
[2021-04-13 02:25:51,482] [DEBUG] Loading corpus from path: corpus\train.spacy
[2021-04-13 02:25:51,482] [INFO] Pipeline: ['tok2vec', 'textcat']
[2021-04-13 02:25:51,483] [INFO] Resuming training for: ['tok2vec']
[2021-04-13 02:25:51,488] [INFO] Created vocabulary
[2021-04-13 02:25:51,488] [INFO] Finished initializing nlp object

I am wondering what actually happens behind the scenes and what are the factors that could help speed up this part.

My theory is that there is a large memory transfer going into the GPU thus why there is minutes worth of waiting when initializing the pipeline.

EDIT: I am running spaCy on a Windows 10 x64 laptop, with Intel i7-8550U 20GB RAM and MX150 w/ 4GB RAM GPU.

adrianeboyd · 2021-04-13T08:16:57Z

adrianeboyd
Apr 13, 2021

By default, spacy train makes an initial pass over the whole train corpus to find labels for all the pipeline components, which can be slow for large corpora. To avoid this, you can provide all the labels in the [initialize] block instead. You can use spacy init labels to generate the labels files, but they're so simple for textcat that it's probably faster to write it by hand instead. It's just a list of all the categories in JSON like this:

[
  "A",
  "B"
]

The relevant part of the [initialize] block then looks like this. If require = false, then it will fall back to reading through the whole corpus if the labels file is missing, so you can change that to true to require the file to be present.

[initialize.components.textcat]

[initialize.components.textcat.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

Be aware that the training loop reads the whole train corpus into memory by default, which can be a problem for really large train corpora. We're working on making this more flexible in an upcoming release (#7208), which should be in the upcoming v3.0.6.

1 reply

smic-datalabs-von Apr 14, 2021
Author

Thank you for this! Makes sense that the training process would need to know all of the labels beforehand.

Looking forward to the #7208 fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speedup initialization of pipeline #7760

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Speedup initialization of pipeline #7760

Uh oh!

Uh oh!

smic-datalabs-von Apr 13, 2021

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Apr 13, 2021

Uh oh!

smic-datalabs-von Apr 14, 2021 Author

smic-datalabs-von
Apr 13, 2021

Replies: 1 comment 1 reply

adrianeboyd
Apr 13, 2021

smic-datalabs-von Apr 14, 2021
Author