Problems with Language.evaluate() using transformer/GPU #9602
Replies: 2 comments 4 replies
-
For the For the Given the limited GPU memory, you might have an easier time starting with |
Beta Was this translation helpful? Give feedback.
-
Since there is no more activity, I suggest closing this with 'no answer'. The 'final' implementation of my evaluate() enhancement uses an 'aggregator' class (performing aggregation of the per-batch scores) and (slightly different) modifications to scoring methods in scorer.py. And by honoring the batch_size, it eliminates GPU OOM. I gave up on investigating the 'discrepancies', mainly because I stopped using [corpus.xxx] max_length (which did not help with GPU OOM problem) - and the 'discrepancies' disappeared. I upgraded my hardware to a 12GB GPU, assuming this will solve my problems. It did NOT. With unchanged Spacy 3.2 code, using transformer NER, the training evaluate() runs out of 12GB GPU memory on 'dev' sample above ~400 documents averaging 1000 words. Standalone 'spacy evaluate' fares a better, but still severely limits the corpus size. Since I consider the evaluate() data-size limitation a serious Spacy problem, I will submit an enhancement request. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am having two problems with Language.evaluate() running against ["transformer","ner"] model:
I apologize for being verbose. I LOVE Spacy3 ... just keep driving myself into a ditch.
Using spaCy version 3.1.3, Windows, Python 3.9.7
Problem #1: Language.evaluate() in GPU mode keeps growing allocated GPU memory.
As the evaluate() iterates over the evaluated data set, the allocated GPU memory keeps linearly growing (~7 MB/doc).
The GPU memory gets released at the end of evaluation (when the evaluate() returns).
This is a major problem, because it limits the size of the 'dev' corpus (else you run out of GPU memory).
The linear growth manifests itself both during training AND when running 'spacy evaluate'.
The easiest way to reproduce the problem is replicating the same DocBin many times and running 'spacy evaluate' against it. Using the same DocBin eliminates issues such as vocabulary growth or some other (transformer) model 'entropy' (minimal because GPU memory gets released - but only on evaluate() return).
Looking at the code, the evaluate() iterates over a set of Example objects (each holding two copies of the 'dev' document), and keeps all such (cloned) objects in memory.
It seems that the pipe invocation attaches some GPU allocated data to the Example object - and does not release it until all Example objects are freed/deleted.
I doubt that the final scorer.score(examples) needs any GPU data.
Problem #2: The evaluate() discrepancy:
Looking at the code, I assume the Language.evaluate() is the same method used during the model training and when using 'spacy evaluate'. And that it (by default) uses the same config.cfg as the training.
Yet the reported scores (f,p,r) are significantly different despite using the same model and the same 'dev' corpus - see details below.
The f-score difference is 0.965 vs 0.893 = 9% (at iteration 24900).
At the end of training (30000 iterations, model-best) it is 0.970 vs 0.916 = 6%.
My training data corpus is 9738 documents averaging 954 words and 3.34/2.60 annotated NAME_FROM/NAME_TO NER entities. The training and dev data is converted from 'test format' to Doc using pipeline:
and
The 'dev' data corpus (created the same way) has been trimmed down to 500 random documents (down from my desired 2000 because of the problem #1).
Below I am showing the results at iteration 24900 where the 'training' is already stable. I noticed the problem at the completion of the previous run. But that previous run had yet another problem: despite completed model training, I could not run 'spacy evaluate' against the 'dev' corpus used in training - 'spacy evaluate' ran out of GPU memory (until I reduced 'dev' corpus size by about 10%).
At iteration 24900:
from model-last/meta.json at iteration 24900:
Running 'spacy evaluate' against copy of the model-last (cloned at iteration 24900), CPU-only:
My config.cfg :
Note several parameters are severely 'tweaked' to 'fit' the training into available 6G GPU memory.
Specifically, I am using corpora.dev max_length = 208 and a very small batch_size = 8.
Beta Was this translation helpful? Give feedback.
All reactions