Training the textcat component, what else to include in the training config. #11394
-
I'm training the Our pipeline involves the following components in this order:
A few weeks ago I had posted a question about So... when I prepare my corpus for training My configuration to train
As I'm training only My colleague has said that the rest of the pipeline is needed, as when spaCy trains With the configuration above, I've successfully completed a training run and my trained Last, what attributes can Further to the above, I've since updated my training configuration file to include the full pipeline. I initially started out with the weights section defined as follows:
After running my config through
As I'm only interested in training the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
It is basically correct that the components in When you train a statistical model, it needs a source of features. In spaCy pipelines that's going to be a tok2vec or Transformer (one exception, see next line). When you train a model, it's usually better to train the feature source with it at the same time. So, in your case, it would make sense to train a tok2vec and textcat at the same time. Also you absolutely do not need to run your training data through the whole pipeline before preparing your DocBins, since the annotations you're applying will basically be ignored. There is one exception to statistical pipelines needing a tok2vec: textcat can run with just a bag of words architecture, in which case you don't need a tok2vec. That architecture is very fast but typically has relatively low accuracy. This does have a side effect that because the tok2vec has changed, any components you weren't training at the same time no longer work with it - it is speaking a different language now. One way to work around that is to train everything together, but it's easier, and often you get the same performance, by just including multiple tok2vecs in a pipeline. See the docs on sharing embedding layers for more information about that. Also, when in doubt, I strongly recommend you try using the default settings from the training quickstart. The defaults are pretty good!
If you're using textcat with a tok2vec then you can customize the attributes the tok2vec uses, see here. You can use any of the attributes in Entity-related attributes are some of those that are present but won't be set with the tok2vec runs normally. Also there have been questions about using NER features in textcat before, see #10470, but basically it probably won't be very effective. |
Beta Was this translation helpful? Give feedback.
It is basically correct that the components in
nlp.pipeline
should only be the ones you are interested in training. However, there is a wrinkle to this.When you train a statistical model, it needs a source of features. In spaCy pipelines that's going to be a tok2vec or Transformer (one exception, see next line). When you train a model, it's usually better to train the feature source with it at the same time. So, in your case, it would make sense to train a tok…