Training Speed compared to Scikit-Learn on BOW #6880

koaning · 2021-02-01T19:48:08Z

koaning
Feb 1, 2021

I'm trying out spaCy v3 with a textcat model. I figured it would be fun to compare scikit-learn to spaCy on a intent classification task. I got a new project working from scratch on a silly dataset locally and the training showed incremental progress.

E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.01        0.00    0.00
  0     200          0.00          1.33        0.00    0.00
  0     400          0.00          0.83        1.93    0.02
  0     600          0.00          0.47       13.32    0.13
  0     800          0.00          0.26       29.98    0.30
  1    1000          0.00          0.14       44.43    0.44
  1    1200          0.00          0.07       54.62    0.55
  1    1400          0.00          0.04       62.81    0.63
  2    1600          0.00          0.02       68.58    0.69
  3    1800          0.00          0.01       72.49    0.72
  3    2000          0.00          0.01       75.61    0.76
  4    2200          0.00          0.00       78.22    0.78
  6    2400          0.00          0.00       80.29    0.80
  7    2600          0.00          0.00       81.89    0.82
  8    2800          0.00          0.00       83.44    0.83

Given that I have a basic example running I started wondering what is happening under the hood and what kind of model it was running. It was running somewhat slowly on my machine. So I started looking at the model configuration.

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

This surprised me. I've also got a bag of words pipeline with scikit-learn (countvectors + logistic regression) but scikit-learn runs for about 1 minute while spaCy easily seems to take tens of minutes. So I'm wondering, am I doing something wrong? Or should we expect that spaCy uses a more general optimizer and can be expected to be slower than the Logistic Regression implementation of scikit-learn?

Answered by honnibal

Feb 3, 2021

In general I'd recommend the py-spy profiler to get a quick sense of where all the time is going. You can attach it to a running Python process without altering any code.

My guess is that all the time will be spent in tokenization and optimization, probably optimization in particular. The Adam solver is kind of slow on problems like this, and when you have a task that otherwise runs really quickly, the slowdown becomes more apparent. It could also be the model though.

Btw an even better toolkit for bag-of-words text classification is Vowpal Wabbit. I've always wanted a spaCy integration for that, because it's really awesome, and it's the right thing to use for a lot of problems.

View full answer

honnibal · 2021-02-03T11:41:55Z

honnibal
Feb 3, 2021
Maintainer

In general I'd recommend the py-spy profiler to get a quick sense of where all the time is going. You can attach it to a running Python process without altering any code.

My guess is that all the time will be spent in tokenization and optimization, probably optimization in particular. The Adam solver is kind of slow on problems like this, and when you have a task that otherwise runs really quickly, the slowdown becomes more apparent. It could also be the model though.

Btw an even better toolkit for bag-of-words text classification is Vowpal Wabbit. I've always wanted a spaCy integration for that, because it's really awesome, and it's the right thing to use for a lot of problems.

1 reply

koaning Feb 3, 2021
Author

I'll look into Vowpal Wabbit. You're not the first one to tell it's still part of the winning team.

Thanks for the answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training Speed compared to Scikit-Learn on BOW #6880

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training Speed compared to Scikit-Learn on BOW #6880

Uh oh!

Uh oh!

koaning Feb 1, 2021

Replies: 1 comment · 1 reply

Uh oh!

honnibal Feb 3, 2021 Maintainer

Uh oh!

koaning Feb 3, 2021 Author

koaning
Feb 1, 2021

Replies: 1 comment 1 reply

honnibal
Feb 3, 2021
Maintainer

koaning Feb 3, 2021
Author