Training Speed compared to Scikit-Learn on BOW #6880
-
I'm trying out spaCy v3 with a textcat model. I figured it would be fun to compare scikit-learn to spaCy on a intent classification task. I got a new project working from scratch on a silly dataset locally and the training showed incremental progress.
Given that I have a basic example running I started wondering what is happening under the hood and what kind of model it was running. It was running somewhat slowly on my machine. So I started looking at the model configuration. [components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null This surprised me. I've also got a bag of words pipeline with scikit-learn (countvectors + logistic regression) but scikit-learn runs for about 1 minute while spaCy easily seems to take tens of minutes. So I'm wondering, am I doing something wrong? Or should we expect that spaCy uses a more general optimizer and can be expected to be slower than the Logistic Regression implementation of scikit-learn? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
In general I'd recommend the My guess is that all the time will be spent in tokenization and optimization, probably optimization in particular. The Adam solver is kind of slow on problems like this, and when you have a task that otherwise runs really quickly, the slowdown becomes more apparent. It could also be the model though. Btw an even better toolkit for bag-of-words text classification is Vowpal Wabbit. I've always wanted a spaCy integration for that, because it's really awesome, and it's the right thing to use for a lot of problems. |
Beta Was this translation helpful? Give feedback.
In general I'd recommend the
py-spy
profiler to get a quick sense of where all the time is going. You can attach it to a running Python process without altering any code.My guess is that all the time will be spent in tokenization and optimization, probably optimization in particular. The Adam solver is kind of slow on problems like this, and when you have a task that otherwise runs really quickly, the slowdown becomes more apparent. It could also be the model though.
Btw an even better toolkit for bag-of-words text classification is Vowpal Wabbit. I've always wanted a spaCy integration for that, because it's really awesome, and it's the right thing to use for a lot of problems.