Problems with Language.evaluate() using transformer/GPU #9602

mbrunecky · 2021-11-02T23:59:50Z

mbrunecky
Nov 2, 2021

I am having two problems with Language.evaluate() running against ["transformer","ner"] model:

The 'spacy evaluate' in GPU mode keeps growing allocated GPU memory, preventing large evaluation (and large 'dev' corpus during training)
There is a discrepancy between results reported by evaluate() during training and 'spacy evaluate' command

I apologize for being verbose. I LOVE Spacy3 ... just keep driving myself into a ditch.
Using spaCy version 3.1.3, Windows, Python 3.9.7

Problem #1: Language.evaluate() in GPU mode keeps growing allocated GPU memory.
As the evaluate() iterates over the evaluated data set, the allocated GPU memory keeps linearly growing (~7 MB/doc).
The GPU memory gets released at the end of evaluation (when the evaluate() returns).
This is a major problem, because it limits the size of the 'dev' corpus (else you run out of GPU memory).
The linear growth manifests itself both during training AND when running 'spacy evaluate'.

The easiest way to reproduce the problem is replicating the same DocBin many times and running 'spacy evaluate' against it. Using the same DocBin eliminates issues such as vocabulary growth or some other (transformer) model 'entropy' (minimal because GPU memory gets released - but only on evaluate() return).

Looking at the code, the evaluate() iterates over a set of Example objects (each holding two copies of the 'dev' document), and keeps all such (cloned) objects in memory.
It seems that the pipe invocation attaches some GPU allocated data to the Example object - and does not release it until all Example objects are freed/deleted.
I doubt that the final scorer.score(examples) needs any GPU data.

Problem #2: The evaluate() discrepancy:
Looking at the code, I assume the Language.evaluate() is the same method used during the model training and when using 'spacy evaluate'. And that it (by default) uses the same config.cfg as the training.
Yet the reported scores (f,p,r) are significantly different despite using the same model and the same 'dev' corpus - see details below.
The f-score difference is 0.965 vs 0.893 = 9% (at iteration 24900).
At the end of training (30000 iterations, model-best) it is 0.970 vs 0.916 = 6%.

My training data corpus is 9738 documents averaging 954 words and 3.34/2.60 annotated NAME_FROM/NAME_TO NER entities. The training and dev data is converted from 'test format' to Doc using pipeline:

    nlp = spacy.blank("en")
    # In order to break doc into limited-size train word batches, we may need sentence annotation
    # The simplest way seems to be the following. Note I use only a limited set of punct_chars
    config = {"punct_chars": ['!', '.', '?'] }
    nlp.add_pipe("sentencizer", config=config)
    doc  = nlp(text)

and

    tags = offsets_to_biluo_tags(doc, annots['entities'])
    doc.ents = biluo_tags_to_spans(doc, tags)

The 'dev' data corpus (created the same way) has been trimmed down to 500 random documents (down from my desired 2000 because of the problem #1).
Below I am showing the results at iteration 24900 where the 'training' is already stable. I noticed the problem at the completion of the previous run. But that previous run had yet another problem: despite completed model training, I could not run 'spacy evaluate' against the 'dev' corpus used in training - 'spacy evaluate' ran out of GPU memory (until I reduced 'dev' corpus size by about 10%).

At iteration 24900:

E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  -------------  --------  ------  ------  ------  ------
  2   24900         825.92    148.71   96.50   95.73   97.27    0.97

from model-last/meta.json at iteration 24900:

  "performance":{
    "ents_f":0.9649542015,
    "ents_p":0.9573291189,
    "ents_r":0.9727017262,
    "ents_per_type":{
      "NAME_FROM":{
        "p":0.93891558,
        "r":0.9654199012,
        "f":0.9519832985
      },
      "NAME_TO":{
        "p":0.9823091248,
        "r":0.9823091248,
        "f":0.9823091248
      }
    },
    "transformer_loss":825.9177389145,
    "ner_loss":148.7077476964
  }

Running 'spacy evaluate' against copy of the model-last (cloned at iteration 24900), CPU-only:

{
  "token_acc":1.0,
  "token_p":1.0,
  "token_r":1.0,
  "token_f":1.0,
  "ents_p":0.8720208945,
  "ents_r":0.9144128723,
  "ents_f":0.8927139037,
  "ents_per_type":{
    "NAME_FROM":{
      "p":0.8891415577,
      "r":0.9548229548,
      "f":0.9208124816
    },
    "NAME_TO":{
      "p":0.8489263804,
      "r":0.8628215121,
      "f":0.8558175493
    }
  },
  "speed":563.091818351
}

My config.cfg :
Note several parameters are severely 'tweaked' to 'fit' the training into available 6G GPU memory.
Specifically, I am using corpora.dev max_length = 208 and a very small batch_size = 8.


[paths]
train = "C:\\Work\\ML\\Spacy3\\dataset\\ca_placer_dtr_8k_sel_12/train"
dev = "C:\\Work\\ML\\Spacy3\\dataset\\ca_placer_dtr_8k_sel_12/valid"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 8
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 128

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 128
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 208
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 208
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 8000
max_epochs = 0
max_steps = 30000
eval_frequency = 300
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 832
buffer = 208
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 30000
initial_rate = 0.00005

[training.score_weights]
ents_f = 0.5
ents_p = 0.2
ents_r = 0.3
ents_per_type = null

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

adrianeboyd · 2021-11-03T09:20:51Z

adrianeboyd
Nov 3, 2021

For the evaluate memory issue, you can add a custom component at the end of your pipeline that deletes doc._.trf_data: #7486 (reply in thread)

For the evaluate results, if there's a minimal reproducible example we can look into it (as a spacy project would be easiest), but you mention so many variants related to the dev corpus construction in your description that my initial guess is that this comes down to a corpus or config difference rather than a bug.

Given the limited GPU memory, you might have an easier time starting with distilbert-base-uncased or a similar model rather than roberta-base, which really does require more RAM than you have to train easily/well.

4 replies

mbrunecky Nov 3, 2021
Author

Thank you for a fast response @adrianeboyd .
But the suggested solution #7486 does not solve the problem. Looking at the language.evaluate() code indicates that it loops over pipeline members, then over documents:

       # apply all pipeline components
        for name, pipe in self.pipeline:
            kwargs = component_cfg.get(name, {})
            kwargs.setdefault("batch_size", batch_size)
            for doc, eg in zip(
                _pipe(
                    (eg.predicted for eg in examples),
                    proc=pipe,
                    name=name,
                    default_error_handler=self.default_error_handler,
                    kwargs=kwargs,
                ),
                examples,
            ):

Hence, my pipeline component "remove_trf_data" gets called after ALL examples ran thru "transformer" component and gobbled up GPU memory via doc..trf_data. I suspect that (given the evaluate() code structure) my only option is creating my own component, "transformer_wrap" (assuming "ner" does not use doc..trf_data) - or rewriting language.evaluate() to run the entire pipeline for each document first. Perhaps there should be an option for executing the entire pipeline for each document, rather than caching interim steps for every example.

As a side note, I found adding my pipeline component via configuration file quite convoluted. I must be missing something, because (after adding function @Language.component("remove_trf_data")) simply changing my config.cfg to:
pipeline = ["transformer","ner", "remove_trf_data"]
demanded configuring [components.remove_trf_data] ... which demanded factory=xxxx.
I ended up using a [nlp.after_pipeline_creation] callback, which seems cumbersome.

On the 'evaluate results inconsistency', I hope that solving the GPU memory issue will allow me returning to more 'normal' configuration. Specifically, I would like to get rid of corpora.dev.max_length=208 (documented as potentially creating inconsistencies) because breaking my data at sentence boundaries does not sound right - even though it should be identical between in-training and standalone evaluation.

adrianeboyd Nov 4, 2021

Ah, you're right, sorry that doesn't help with evaluate. You definitely want Language.evaluate to run the docs in batches for each pipeline component or it will be very slow, and each component can have a custom config (also potentially each component with a custom batch_size, I believe), so the details might get tricky here. I'd have to think about this a bit further...

In v2 the scorer could score incrementally, but in v3 you need to provide all the examples at once.

In terms of generating a valid config, this should handle all the details for you:

nlp.add_pipe("remove_trf_data")
nlp.config.to_disk("/path/to/config.cfg")

You'd still need --code with spacy train, of course.

mbrunecky Nov 5, 2021
Author

First, thanks for the tip with nlp.config.to_disk("/path/to/config.cfg") - it can come handy in other situations as well. Off course, in my case the "remove_trf_data" is a mute point.

I (naively) thought that I can release trf_data by subclassing class TransformerWrap(Transformer) and overriding set_annotations(): call superclass and after it called all the callbacks simply remove the trf_data. Worked like a champ (my GPU memory stopped growing) EXCEPT that the following NER component predicted nothing - it apparently needs the trf_data.

So I am back to looking at the evaluate() code. I agree that the tradeoff between running docs in batches for each pipeline component versus running a single doc thru the entire pipeline is (depending on component setup overhead) definitely in favor of the first.
I am not so concerned about differing batch sizes, I view nlp.batch_size as the 'maximum', used by the main loop in evaluate().
Missing is only a 'score_aggregator' putting together scores from individual batches (weighted by the actual batch size).

I will look into this (though I do not like altering Spacy code). But (if it works) it makes troubleshooting my second problem (inconsistent evaluate() results) even more difficult - by adding another moving part.

mbrunecky Nov 8, 2021
Author

I have my 'fix' for Language.evaluate() working:
Changing the evaluate() code to use minibatch of batch_size and release trf_data at the end of each minibatch is easy, and it keeps GPU memory usage 'reasonable' (in my case ~3GB). And in case some component uses a different batch size, then it will be less optimal, but still should work.
The real problem is the Scorer. The dictionary returned by score() methods does not have enough info to 'aggregate' scores from multiple minibatches. I had to add it (see code below). To stay within Spacy 3 Scorer API it ain't pretty: a proof of concept at best.
Maybe there is a better way... But doing it 'right' is probably too intrusive, maybe something for the Spacy 4.

That said, I can now evaluate unlimited 'dev' data, and use batch_size such as 48 or 96.
I can now move onto my second problem - the evaluate() result inconsistencies.

language.py changed code - evaluate() starting at line ~1367:


        if scorer is None:
            kwargs = dict(scorer_cfg)
            kwargs.setdefault("nlp", self)
            scorer = Scorer(**kwargs)

        # initialize empty aggregated scores
        aggregate = {}

        # reset annotation in predicted docs and time tokenization
        start_time = timer()

        # split all examples into batches
        for example_batch in util.minibatch(examples, batch_size):     
            # apply all pipeline components
            for name, pipe in self.pipeline:
                kwargs = component_cfg.get(name, {})
                kwargs.setdefault("batch_size", batch_size)
                for doc, eg in zip(
                    _pipe(
                        (eg.predicted for eg in example_batch),
                        proc=pipe,
                        name=name,
                        default_error_handler=self.default_error_handler,
                        kwargs=kwargs,
                    ),
                    example_batch,
                ):
                    eg.predicted = doc

            # aggregate scores 
            self.add_score(scorer.score(example_batch), aggregate)
            # release example data (TODO: better way to 'empty' eg)
            for eg in example_batch:
                eg.x._.trf_data = None
                                   
        end_time = timer()
        results = self.make_scores(aggregate)
        n_words = sum(len(eg.predicted) for eg in examples)
        results["speed"] = n_words / (end_time - start_time)
        return results

New code:


    """In ideal world, the Scorer will be returning a 'Scores' object (subclasses)
    encapsulating varieties of various per-component scores, as well as conversions
    to whatever info (individual scores) is needed by the caller.
    Given that Scorer.score() returns a dictionary of both individual score values
    and PRFScore objects, I am making the score aggregation possible within the 
    existing framework:
    
    Score generator(s) such as Scorer.get_nrf_prf() now also return the 'agregate-able'
    PRFScore keyed by 'score_xxx#value'(for scorrer xxx) - allowing to aggregate
    scores from individual batches. PLUS (using a key 'score_xxx#keys') a list
    of post-aggregation individual score(s) keys such as 'ents_p,ents_r,ents_f'.
    Such individual score(s) are produced in make_scores() from aggregated PRFScore-s.
    The make_scores() method drops both 'score_xxx#keys' and 'score_xxx#value'.
    
    Note some Scorer.get_*() methods return more than one 'score_xxx#value',
    'score_xxx#keys', and the list of produced key/value scores (orderd p,r,f) may
    contain empty items = no key/value score generated.
    The ROCAUCScore is not handled in this proof of concept.
    """
    def add_score(self, scores: Dict[str, Any], aggregate: Dict[str, Any]) -> None:
        # Here I add any PRFScore to the previous one (if any), keep/override
        # any strings but ignore floats/ints - we will produce them in make_scores()
        for key, value in scores.items():
            if value is not None:
                if not isinstance(value, (int, float)):
                    if isinstance(value, PRFScore):
                        known = aggregate.get(key)
                        if  known:
                            known += value
                        else:
                            clone = PRFScore()
                            clone += value
                            aggregate[key] = clone
                    else:
                         aggregate[key] = value # things like strings - simply override

    def make_scores(self, aggregate: Dict[str, Any]) -> Dict[str, Any]:
        result = {}
        for key, value in aggregate.items():
            if  key.find('#keys') >= 0:
                source = aggregate.get(key.replace('#keys','#value'))
                if source != None :
                    if isinstance(source, PRFScore):
                        keys = value.split(',')
                        if len(keys) > 0 and len(keys[0]) > 0:
                            result[keys[0]] = source.precision
                        if len(keys) > 1 and len(keys[1]) > 0:
                            result[keys[1]] = source.recall
                        if len(keys) > 2 and len(keys[2]) > 0:
                            result[keys[2]] = source.fscore
                    else:
                        print("Debug: unexpected #value for key", key, source)
                else:
                    print("Debug: missing #value for key", key, source)
            elif key.find('#value') < 0:
                 result[key] = value
        return result

Example of code additions to return from get_ner_prf():

        return {
            # The following two entries are used for score aggregation over multiple batches
            # The #keys specifies how to represent aggregated #values in the resulting score
            "score_nrf#keys": "ents_p,ents_r,ents_f",
            "score_nrf#value": totals,
            "ents_p": totals.precision,
            "ents_r": totals.recall,
            "ents_f": totals.fscore,
            "ents_per_type": {k: v.to_dict() for k, v in score_per_type.items()},
        }

mbrunecky · 2021-11-27T03:47:49Z

mbrunecky
Nov 27, 2021
Author

Since there is no more activity, I suggest closing this with 'no answer'.

The 'final' implementation of my evaluate() enhancement uses an 'aggregator' class (performing aggregation of the per-batch scores) and (slightly different) modifications to scoring methods in scorer.py. And by honoring the batch_size, it eliminates GPU OOM.

I gave up on investigating the 'discrepancies', mainly because I stopped using [corpus.xxx] max_length (which did not help with GPU OOM problem) - and the 'discrepancies' disappeared.

I upgraded my hardware to a 12GB GPU, assuming this will solve my problems. It did NOT.

With unchanged Spacy 3.2 code, using transformer NER, the training evaluate() runs out of 12GB GPU memory on 'dev' sample above ~400 documents averaging 1000 words. Standalone 'spacy evaluate' fares a better, but still severely limits the corpus size.

Since I consider the evaluate() data-size limitation a serious Spacy problem, I will submit an enhancement request.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problems with Language.evaluate() using transformer/GPU #9602

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Problems with Language.evaluate() using transformer/GPU #9602

Uh oh!

mbrunecky Nov 2, 2021

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

adrianeboyd Nov 3, 2021

Uh oh!

mbrunecky Nov 3, 2021 Author

Uh oh!

adrianeboyd Nov 4, 2021

Uh oh!

mbrunecky Nov 5, 2021 Author

Uh oh!

mbrunecky Nov 8, 2021 Author

Uh oh!

mbrunecky Nov 27, 2021 Author

mbrunecky
Nov 2, 2021

Replies: 2 comments 4 replies

adrianeboyd
Nov 3, 2021

mbrunecky Nov 3, 2021
Author

mbrunecky Nov 5, 2021
Author

mbrunecky Nov 8, 2021
Author

mbrunecky
Nov 27, 2021
Author