Training TextCat on GPU never trains and crashes VM #9668

1danjordan · 2021-11-13T00:20:36Z

1danjordan
Nov 13, 2021

I am having an issue getting spacy train to work with en_core_web_trf to train a text classification model. There's a few funky things going on and I'm not sure if what behaviour is expected or not.

When running spacy train the pipeline takes ages to initialise - like 5 minutes or so. Maybe this is because the entire dataset is being loaded into memory?
It freezes just before the training loop starts e.g. I get to this point but no further

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'textcat']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  -------------  ------------  ----------  ------

I'm running watch -n 1 nvidia-smi and can see the GPU isn't being utilised at all
My AWS EC2 instance eventually crashes

Has anyone else experienced issues like this? I had some difficulties getting my Python environment configured correctly which I'm thinking may be the cause of the issues - but I am getting to this point which makes me think it could be something else?

I have CUDA 11 showing in nvidia-smi but nvcc --version says I have CUDA toolkit 10.0.1. So I installed spaCy with pip install spacy[cuda101,transformers]. This resulted in an out of date spacy-transformers version which didn't contain spacy-transformers.TransformerModel.v3:

catalogue.RegistryError: [E893] Could not find function 'spacy-transformers.TransformerModel.v3' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

So I updated with pip install -U spacy-transformers which seemed to solve that issue.

I would include pip freeze but my VM still hasn't rebooted. I'm flummoxed to be honest!

Answered by polm

Nov 13, 2021

What is your actual train command? If the GPU isn't being used it could be trying to use a Transformer on CPU, which is possible but extremely slow. That would be consistent with your nvidia-smi output.

Like you suspect, the whole training set is loaded into memory by default. If that's an issue you can use a custom loader to stream the corpus instead, see here.

Also, just to rule out other issues, you might try training with a non-transformer base model.

It sounds like you are having out of memory issues, though I wouldn't normally expect that to crash the VM.

View full answer

polm · 2021-11-13T03:44:03Z

polm
Nov 13, 2021

What is your actual train command? If the GPU isn't being used it could be trying to use a Transformer on CPU, which is possible but extremely slow. That would be consistent with your nvidia-smi output.

Like you suspect, the whole training set is loaded into memory by default. If that's an issue you can use a custom loader to stream the corpus instead, see here.

Also, just to rule out other issues, you might try training with a non-transformer base model.

It sounds like you are having out of memory issues, though I wouldn't normally expect that to crash the VM.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training TextCat on GPU never trains and crashes VM #9668

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training TextCat on GPU never trains and crashes VM #9668

Uh oh!

1danjordan Nov 13, 2021

Replies: 1 comment

Uh oh!

polm Nov 13, 2021

1danjordan
Nov 13, 2021

polm
Nov 13, 2021