Skip to content
Discussion options

You must be logged in to vote

What is your actual train command? If the GPU isn't being used it could be trying to use a Transformer on CPU, which is possible but extremely slow. That would be consistent with your nvidia-smi output.

Like you suspect, the whole training set is loaded into memory by default. If that's an issue you can use a custom loader to stream the corpus instead, see here.

Also, just to rule out other issues, you might try training with a non-transformer base model.

It sounds like you are having out of memory issues, though I wouldn't normally expect that to crash the VM.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf / memory Performance: memory use
2 participants