Sizing and controlling GPU memory for training #9451

mbrunecky · 2021-10-13T17:05:48Z

mbrunecky
Oct 13, 2021

Background:
In Spacy 2.3 I was able to use an 8GB GPU for all my NER training, getting about 3x better performance. With Spacy 3, the documentation suggests 'at least 10GB' and sure enough, the same models run out of 8GB GPU memory (unless I tweak down the parameters - but than my prediction accuracy suffers).
With the addition of transformers (and they are great), the GPU memory needs go even higher (at least I am unable to 'tweak down' my configuration to fit into 8GB).
That said, the GPU memory usage is significantly lower in production (using trained models to get predictions) . I can easily use my 8GB GPU for my (CPU-only trained) transformer models or en_core_web_trf - and get the speed.

This raises two questions:

Which config parameters have the biggest impact upon training GPU memory usage (separately for tok2vec and transformers)? Any parameters that do not affect accuracy?
How much GPU memory one may need for training, given the corpus characteristics (number of documents, sizes, number of labels etc.)? Any rough guidelines?

I do not want to buy another GPU card (especially at today's prices) and later find out that (for example) 12GB is not enough, and I need another upgrade. On the other hand, training transformer on CPU only (even with 20 cores) takes too many days...

On the subject of 'controlling GPU memory usage':
It seems that batch_size has only a limited impact (used mainly during validation), so reducing it only adds some overhead and saves memory.
For tok2vec training, significant parameters seem (and there is probably more)

[components.tok2vec.model.encode]
width = 256
depth = 8

Reducing width/depth saves memory - but (in my case) goes against accuracy.

And (despite lots of trying) I have not found anything that would reduce the transformer memory greed :-).

adrianeboyd · 2021-10-18T08:38:12Z

adrianeboyd
Oct 18, 2021

There are two batch sizes to adjust for GPU RAM:

training.batcher for the train step
nlp.batch_size for the eval step

The training batch size may have some minor effects on performance.

The main corpus parameter to consider is text length, which is why the corpus reader also has the option to break texts up into individual sentences if they are beyond a specified token length. We use a lower max_length for trf models by default in the pretrained pipelines. I think the user-facing default is now 0 to disable this by default, since it leads to unexpected results if you have long texts + no labeled sentence boundaries. But if you have sentence boundaries it can be an easy way to even out batch sizes without having to modify the original corpus.

3 replies

mbrunecky Oct 20, 2021
Author

Thank you @adrianeboyd .
After many trials, I managed to 'squeeze' my test transformer-NER training into my 6 GB GTX 1660, and (despite not finished yet) the scores look better than the tok2vec NER or SPANCAT. Scores are (still) lower than for the same test using CPU only (and defaults) - but that one took 5 days, and here I am running for 3 hours.

The main help was your explanation of corpus max_length (I see eng_core_web_trf uses 500). Thank you for that. I am converting my data into DocBin using my own code, so it was easy to modify DocBin generating code to:

    nlp = spacy.blank("en")
    config = {"punct_chars": ['!', '.', '?'] }
    nlp.add_pipe("sentencizer", config=config)

My training data are 'documents' varying between 2 to 8k characters - that translates into 400 to 1600 words each. To make it work, I had to reduce corpus max_length=100 and training batch_size=10. But even that was not enough.
I had to reduce

[components.transformer]
 max_batch_items = 128

and

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 672
buffer = 96

I do not understand the relationship between the training.batcher and corpus max_length, I would assume that with max_length=100 I would not need batcher size limit greater than 100 - but I need it.
Also, max_batch_items for transformer seems to be a 'batch_size' for transformer training...

The main outcome of this exercise is that I now have some 'feel' for what to expect from transformer on GPU, and some assurance that one can 'squeeze' the training into available GPU memory - although at the cost of loosing some accuracy.

karndeepsingh Nov 25, 2021

Hey, @mbrunecky!
I am facing a similar CUDA error issue. I have 5K sentences to be trained for NER and I have 12GB GPU. Can you please help me with how I can fit my 5K dataset for training, I am also making DocBin with my own code my own sentences. Please help. Configurations I am using for transformers as shown below:

[paths]
train = null
dev = null
raw = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","ner","relation_extractor"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
batch_size = 4

[components]

[components.transformer]
factory = "transformer"
max_batch_items = 32
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 64
stride = 48

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.relation_extractor.model.create_instance_tensor.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 10

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[initialize]

[initialize.components]

[corpora]

[corpora.dev]
@readers = "Gold_ents_Corpus.v1"
file = ${paths.dev}
# # max_length = 512
# [corpora.dev]
# @readers = "spacy.Corpus.v1"
# path = ${paths.dev}
# gold_preproc = True
# max_length = 512


[corpora.train]
@readers = "Gold_ents_Corpus.v1"
file = ${paths.train}
# max_length = 512

# [corpora.train]
# @readers = "spacy.Corpus.v1"
# path = ${paths.train}
# gold_preproc = True
# max_length = 512

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600000
max_epochs = 0
max_steps = 2000
eval_frequency = 10
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
logger = {"@loggers":"spacy.ConsoleLogger.v1"}

# [training.batcher]
# @batchers = "spacy.batch_by_padded.v1"
# discard_oversize = true
# size = 2000
# buffer = 256

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 32
buffer = 32
# [training.batcher]
# @batchers = "spacy.batch_by_sequence.v1"
# size = 4
# get_length = null

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.score_weights]
rel_micro_p = 0.0
rel_micro_r = 0.0
rel_micro_f = 1.0

amarshrinat Apr 25, 2023

Hi karndeep, can you help me with relation extraction with spacy-transformer, from where we can download this config for 'relation_extraction', training script of "["transformer","ner","relation_extractor"]" is as same as Ner ?

mbrunecky · 2021-11-26T21:24:16Z

mbrunecky
Nov 26, 2021
Author

Well, I have gained a bit more experience since I wrote the above. I bought a 12G RTX 3060 ... and my struggle with GPU out of memory did not end there.
Assuming that your 5k sentences mean about 1000 words per 'document', your case should be similar to my situation: Processing up to 10,000 documents, averaging ~1000 words each and two NER labels (averaging close to 2 * 3 entities) per document. I package 100 docs into my DocBins. However, I know NOTHING about "relation_extractor", and it may consume significant resources.

The fist question to ask is 'in which training phase you run out of memory'. I kept running out of GPU memory in the 'evaluate()' phase, and learned (see #9602) that evaluate() is designed to keep the entire 'dev' corpus in memory (twice + tensor data). Initially I tried using [corpora.dev] max_length = 200, but (after working with language.py) I learned that it is useless - regardless of how you break up your 'dev' corpus, it will be loaded into memory ALL at once, and tensors are kept in GPU till the last document is scored.
The only remedy is either modifying Spacy language.py (and scorer.py) or significantly reduce the 'dev' corpus size. In my case, without code changes, I have to go from 20% (2000 docs) down to 2,5% (500 docs), and only 300 docs in case I use the SpanCategorizer.
Luckily (in my case), even the 300 docs happens to produce reasonably representative PRF scores.
I noticed you are using eval_frequency = 10. This seems excessive, especially since the 'cost' of evaluation is significant. I am running thru much more 'documents' (up to 10,000), up to 30,000 iterations - and evaluating with eval_frequency = 800. Much faster.

Note that when using spacy evaluate command, your 'eval' data sample may be bigger than 'dev' - but not that much (same code).

IF you are running out of memory during the training (update), the main parameters that do seem to matter are:

the sample (# of documents/words) size
nlp batch_size (in some cases, I went as low as 8)
[training.batcher] (I seem to be able to run with buffer=256, size = 1500)
In my opinion, using {corpus} max_length = nnn does not help at all, as it only breaks the batch into smaller pieces - does not reduce the total number of words in the batch.

That said, perhaps I would try to run only the NER, and only after you succeed there, I would look at what "relation_extractor" addds.

2 replies

karndeepsingh Nov 26, 2021

@mbrunecky Thanks for your wonderful answer. I was also dubgging the error and found that when I train for larger data like 5k samples it get stuck at evaluate() step in training/loop.py file.
As per your suggestion, I will increase the eval frequency and will decrease the dev data and eval data as well.

Thanks again.

mbrunecky Nov 27, 2021
Author

Note that higher value for eval_frequency means that your scores are evaluated less frequently, and hence you may have to take a look at your 'patience'. But unlike tok3vec models, my transformer models seem to improve (the score) fairly linearly. In many cases, instead of stopping by 'patience' I stop by max iterations, and the best case is within the last couple iterations.
But 'dev' evaluation frequency has nothing to do with the resources used by the evaluate() step during training, it's all given by the 'dev' data volume.
Also note that command spacy evaluate uses the same code as evaluations during the training. But because there is no 'extra' data allocated in GPU (just the final model), the command can take more data without running out of GPU memory. So IF you by any chance have a similar model, you can use the spacy evaluate command to find out how big your 'dev' sample can be (you still have to go lower for training).

mbrunecky · 2021-11-27T02:49:50Z

mbrunecky
Nov 27, 2021
Author

Regarding max_length, your [corpora.train] has it commented out:

@readers = "Gold_ents_Corpus.v1"
file = ${paths.train}
# max_length = 512

As I wrote above, using max_length for [corpora.dev] = evaluate() is pointless, because the code does it's best to load all the data to memory regardless of the batch_size (which is only used by the pipeline components). There were some comments that using max_length other than zero may lead to 'inconsistent results', especially when there are no good sentence markers. I use the sentencier (in my data preparation pipeline), and I have 'good' sentence markers - and I was still getting inconsistent results. Using max_length = 200 for [corpora.train] did not help with the GPU OOM, so I abandoned it.
There is some code (and comment) in scorer.py (for NER) that I suspect to be the 'inconsistency' culprit, but I do not have the bandwidth to investigate it right now. I want to migrate my evaluate() and scorer changes from my Spacy 3.1 to 3.2 so that I can use more reasonable 'dev' corpus.

Below is the config.cfg which handles my data set of 9738 'doc' averaging 956 words (Spacy tokens) per document (with 'max' about 1.5 times of avg). The 'dev' uses only 400 such 'doc', above it I get GPU OOM in evaluate() during training.

Windows 10, GPU is RTX 3060 12GB, installed NVIDIA CUDA 11.5 and using Spacy 3.2 with:
pip uninstall cupy-cuda114
pip install -U cupy-cuda113
pip uninstall torch
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html

config.cfg

[paths]
train = "C:\\Work\\ML\\Spacy3\\dataset\\ca_placer_dtr_8k_sel_12/train"
dev = "C:\\Work\\ML\\Spacy3\\dataset\\ca_placer_dtr_8k_sel_12/valid"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 96
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 128

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 1024
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
#max_length = 1200
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 8000
max_epochs = 0
max_steps = 30000
eval_frequency = 800
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 1536
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 30000
initial_rate = 0.00005

[training.score_weights]
ents_f = 0.5
ents_p = 0.2
ents_r = 0.3
ents_per_type = null

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

--

3 replies

karndeepsingh Nov 27, 2021

@mbrunecky Thank you again for your valuable insight.
That is very correct. Even I have been training with 1500 sentence train data and 400 dev data it started training. But as I increase dev data few it threw a CUDA ERROR. Does a number of training samples also matter or we can use larger training data but with condition, the dev data should be small?

So, this issue has not been fixed yet as I can see on the discussion forum you have raised an issue already and explained it very nicely and even did some changes to language.py and scorer.py. This is a generic issue they have to look into this. After changing some part of code in language.py and scorer.py were you able to fit larger dev data in memory and resolve the OOM issue? If so what are the changes I need to do to my codes as well? Please, can you help me understand this I would be good?

What are the challenges I could expect during inferencing? Does this OOM issue rise when I will pass the data to the model after getting trained in production?

Any suggestion on the production side as well will be good as I don't want to run into memory issues on the production server.

You mentioned migration from spacy 3.1 to spacy 3.2 so, does this issue has been resolved in spacy 3.2? And if so what versions you are using for the transformer.

Also, I would like to connect with you directly, here is my LinkedIn Id : https://www.linkedin.com/in/karndeepsingh
or you can share any platform where I can connect with you.

Thanks again.
Cheers.

mbrunecky Nov 28, 2021
Author

With regards to evaluate() OOM:
In general, Spacy is using 'batching' to minimize overhead of setup/rundown by doing it once and then running thru a bunch of documents. For example, using a batch of 100 documents 'spreads' the setup/rundown over 100 documents. Hence, all Spacy API (such as nlp.pipe()) are 'batch' oriented. A pipeline {"transformer", "ner"} will run 100 documents thru the "transformer", and then those 100 documents thru the "ner" and hence all the trf_data must be kept in GPU memory all that time.

During inferencing, in (my) "prediction server", I receive one document at a time, run that document thru the entire pipeline, return the result and discard all document related data. Hence, the production GPU memory utilization is much lower - one document at a time, plus whatever parts of the model are cached. In my case, an order of 2GB. Of course, running one document at a time thru the entire pipeline is more costly, but the 'speeds' I am getting are good enough for me. And there is no way I could make my clients 'batch' the documents - not even two. But to give credit to Spacy folks, the pipe() is thread-safe, so I can use a multi-threaded server.

In case of evaluate, batching is taken to the ultimate. Instead of using batch_size (as many people incorrectly imply), evaluate() uses ONE batch: all documents. The batch_size is passed to each pipeline components, so it may use it (or have it's own batching rules). But evaluate() will first run ALL document thru transformer (attaching GPU tensors to each document), and then ALL document thru NER, and (if you have another component) all thru that component, while keeping all the necessary data in memory - both main and GPU.
When not using transformer, all cached data is only in CPU memory - and CPU memory is cheap (well, unless you use the SpanCategorizer which can run out of my 64 GB main memory :-).

I am told that in Spacy 2, the scorer was 'incremental'. But in Spacy 3 it is not - it must score ALL documents at once. My code change makes the scorer 'incremental', but it changes the scorer returns and requires an 'aggregator' to put pieces together - and convert them to what the rest of the code expects.
The problem is that in Spacy 3.2 they announced support for custom scorers - which means that 'non-incremental API' will be very difficult to change. Of course, it is type-safely defined as dict { String, ANY }.

karndeepsingh Nov 28, 2021

@mbrunecky Thanks Again.
In my case, a batch of documents needed to be passed into the spacy model instead of a single doc passing through at a time. How batch of documents can be passed into the spacy model? Any suggestion on this would be helpful for me to build a final pipeline.

For now, I am using the following code for inferencing on a single doc (NER + RELATIONAL), if could also share a better way of inferencing on a single doc or batch doc
that would be helpful.

import random
import typer
from pathlib import Path
import spacy
import pandas as pd
from spacy import displacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example
from rel_pipe import make_relation_extractor, score_relations
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

prefix_re = re.compile(r'''^[\$\\\[\(]''')
suffix_re = re.compile(r'''[\.\,\$\\\]\)]''')
infix_re = re.compile(r'''[\?\:\;\~\-\\\'\"]''')


def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)
# We load the relation extraction (REL) model

nlp2= spacy.load("./training/model-best",)
nlp2.add_pipe('sentencizer')

text = """DEFINITIONS Words used in multiple sections of this document are defined 13, 18, 20 and 21. Certainrules regarding the usage of wd ords are defined in Sections 3, 11, oetiment are also provided in Section 16. (A) "Security Instrument" means this document, with a Riders to this document. (B) "Borrower"is LINDSAY M PATTERSON, GUST 31, 2018 , together Borrower is the mortgagor under this Security Lom (C) "MERS" is Mortgage Electronic Registratin-Systems, Inc. MERS is a separate corporation that is acting solely as a nominee for Lender ap $ Successors and assigns. MERS is the mortgagee under this Security (D) "Lender" is GUA Lender isa DELAWARE SORRORA' organized and existing under the laws of WARE Lender's address is 3940 N RAWENSWOOD, CHICAGO, ILLINOIS 60613 (E) "Note" means the promissory note signed by Borrower and dated AUGUST 31, 2018 The Note states that Borrower owes Lender TWO HUNDRED NINETY-FIVE THOUSAND AND 00/100 Doars (U.S. $ 295,000.00 ) plus interest. Borrower has promised to pay this debt in regular Periodic Payments and to pay the debt in fu not later than Wheatland Title Guaranty 105 W. Veterans Parkway, Yorkvie, IL 60560 vw... a -- i,t. _=<_ dim  mm lk (F) "Property" means the property that is described below under the heading "Transfer of Rights in the Property." (G) "Loan" means the debt evidenced by the Note, plus interest, any prepayment charges and late charges due under the Note, and a sums due under this Security Instrument, plus interest. (H) "Riders" means a Riders to this Security Instrument that are executed by Borrower. The foowing Riders are to be executed by Borrower [check box as applicable]:  Adjustable Rate Rider Planned Unit Development Rider  Baoon Rider  Biweekly Payment Rider | 1+ Family Rider  Second Home Rider  Condominium Rider [X] Other(s) [specify] Fixed Interest Rate Rider (L) (M) 'Misceaneous Proceeds'' means any compens third party (other than insurance proceeds paid under (N) 'Mortgage Insurance" means insurance prote (O) "Periodic Payment" means the regularly plus (ii) any amounts under Section 3 of this-$ (P) "RESPA" means the Real Estate Se TRANSFER OF RIGHTS INTHE PROPERTY This Security Instrument secures to Lender: (i) the repayment of the Loan, and a renewals, extensions and modifications of the Note; and (ii) the performance of Borrower's covenants and agreements under this Security Instrument and the Note. For this purpose, Borrower does hereby mortgage, grant and convey to MERS (solely as nominee for Lender and Lender's successors and assigns) and to the successors and assigns of MERS the foowing described property located in the COUNTY ot Kenda [Type of Recording Jurisdiction] {Name of Recording Jurisdiction] ILLINOIS - Single Family - Fannie Mae/Freddie Mac UNIFORM INSTRUMENT - MERS DocMagic @Farms Form 3014 1/01 Page 2 of 14 www. docmagic.com 201800013191 2/19"""
nlp2.tokenizer = custom_tokenizer(nlp2)

doc = nlp2(text)
columns = {}
entities = []
labels = []
sentences=text
for token in doc.ents:
  # print(token,token.label_)
  entities.append(token.text)
  labels.append(token.label_)
columns["Sentence"] =text  
columns["Entities"]= entities
columns["labels"] = labels
displacy.render(doc, style='ent', jupyter=True,)
print("\n")
dataframe = pd.DataFrame(columns,dtype=str)
dataframe.groupby("Sentence").apply(lambda x : x[:]).drop("Sentence",axis=1)
for value, rel_dict in doc._.rel.items():
        for sent in doc.sents:
          for e in sent.ents:
            for b in sent.ents:
              if e.start == value[0] and b.start == value[1]:
                if rel_dict['ROLE']>=0.20:
                  print(f" entities: {e.text, b.text} --> predicted relation: (PARTY ROLE-{rel_dict['ROLE']}) ---> predicted entity: {e.label_,b.label_}")
                if rel_dict['ADDRESS']>=0.20 :
                  print(f" entities: {e.text, b.text} --> predicted relation:(PARTY ADDRESS-{rel_dict['ADDRESS']}) ---> predicted entity: {e.label_,b.label_}")
                if rel_dict['IDCODE']>=0.20:
                  print(f" entities: {e.text, b.text} --> predicted relation:(PARTY IDCODE-{rel_dict['IDCODE']}) ---> predicted entity: {e.label_,b.label_}")
                if rel_dict['PHRASE']>=0.20:
                  print(f" entities: {e.text, b.text} --> predicted relation:(PARTY IDCODE-{rel_dict['PHRASE_RELATION']}) ---> predicted entity: {e.label_,b.label_}")    

print("\n")
dataframe.groupby("Sentence").apply(lambda x : x[:]).drop("Sentence",axis=1)

karndeepsingh · 2021-11-28T12:07:32Z

karndeepsingh
Nov 28, 2021

@mbrunecky Btw, I am training NER + RELATIONAL model together with 5000 training dataset and 500 dev dataset. And following image shows GPU usage while training, Any suggestions on increasing the dev data also would be helpful as you can see their is still some memory left in GPU which is unused:

0 replies

mbrunecky · 2021-11-28T15:08:21Z

mbrunecky
Nov 28, 2021
Author

On Windows, the TaskManager GPU performance provides nice graphs showing GPU memory, CUDA and copy, copy2 usage – I use the ‘slow’ refresh. The GPU memory usage with torch is tricky, because it’s memory allocator keeps the GPU memory even when not in use, and them seems to release it at some (perhaps predictable) points. I found that the max GPU memory usage can go up anytime (often late) during the first epoch – there may be some bigger document ‘later’ in the data. I relax only after the first epoch completed. And my GPU memory usage eventually grows to 11.5 GB, because IF I see less usage I just add more DocBins to my ‘dev’ sample for the next run. From: Karndeep Singh ***@***.*** Sent: Sunday, November 28, 2021 5:08 AM To: explosion/spaCy ***@***.***> Cc: Martin Brunecky ***@***.***>; Mention ***@***.***> Subject: [EXT] - Re: [explosion/spaCy] Sizing and controlling GPU memory for training (Discussion #9451) @mbrunecky<https://github.com/mbrunecky> Btw, I am training NER + RELATIONAL model together with 5000 training dataset and 500 dev dataset. And following image shows GPU usage while training, Any suggestions on increasing the dev data also would be helpful as you can see their is still some memory left in GPU which is unused: [image]<https://user-images.githubusercontent.com/49562460/143767020-13beee38-0bd1-4a2b-ae8b-477e512bbb21.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#9451 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOGQRXJZZMEV3YK2RBJVHSLUOILQ5ANCNFSM5F5V4NDA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

7 replies

mbrunecky Nov 29, 2021
Author

Just thinking ... perhaps you can load your CONLL annotated data into a (blank) Spacy document, and then 'split' that document as needed ... with the 'split' position somewhere between your annotations.

karndeepsingh Dec 1, 2021

@mbrunecky Thanks for replying. I have been training NER Transformer model and while training it is evaluated at every 500 iterations. So, F1 score, Recall, and precision are shown in 93-95% while training. And when I explicitly run command ! spacy project run evaluate on the same dev dataset that I have used for evaluation while training, it shows very little F1 Score, Precision, and Recall. Can please advise here what is the problem and how can get exact metrics to evaluate? Which metric shall I believe, the information that is printed while training or information printed while running ! spacy project run evaluate? Below is the output when I run! a spacy project run evaluate

Running command: /home/gsatis/anaconda3/envs/spacy/bin/python ./scripts/evaluate.py training/model-best data/relations_final_test.spacy False

Random baseline:
threshold 0.00 	 {'rel_micro_p': '5.07', 'rel_micro_r': '100.00', 'rel_micro_f': '9.66'}
threshold 0.05 	 {'rel_micro_p': '5.07', 'rel_micro_r': '94.77', 'rel_micro_f': '9.62'}
threshold 0.10 	 {'rel_micro_p': '5.07', 'rel_micro_r': '89.73', 'rel_micro_f': '9.60'}
threshold 0.20 	 {'rel_micro_p': '5.12', 'rel_micro_r': '80.65', 'rel_micro_f': '9.63'}
threshold 0.30 	 {'rel_micro_p': '5.10', 'rel_micro_r': '70.45', 'rel_micro_f': '9.52'}
threshold 0.40 	 {'rel_micro_p': '5.22', 'rel_micro_r': '61.72', 'rel_micro_f': '9.63'}
threshold 0.50 	 {'rel_micro_p': '5.21', 'rel_micro_r': '51.48', 'rel_micro_f': '9.46'}
threshold 0.60 	 {'rel_micro_p': '5.25', 'rel_micro_r': '41.52', 'rel_micro_f': '9.32'}
threshold 0.70 	 {'rel_micro_p': '5.14', 'rel_micro_r': '30.59', 'rel_micro_f': '8.80'}
threshold 0.80 	 {'rel_micro_p': '5.42', 'rel_micro_r': '21.35', 'rel_micro_f': '8.65'}
threshold 0.90 	 {'rel_micro_p': '5.49', 'rel_micro_r': '10.70', 'rel_micro_f': '7.25'}
threshold 0.99 	 {'rel_micro_p': '6.29', 'rel_micro_r': '1.19', 'rel_micro_f': '2.01'}
threshold 1.00 	 {'rel_micro_p': '5.45', 'rel_micro_r': '0.12', 'rel_micro_f': '0.23'}

Results of the trained model:
threshold 0.00 	 {'rel_micro_p': '5.08', 'rel_micro_r': '100.00', 'rel_micro_f': '9.66'}
threshold 0.05 	 {'rel_micro_p': '64.12', 'rel_micro_r': '94.30', 'rel_micro_f': '76.34'}
threshold 0.10 	 {'rel_micro_p': '72.77', 'rel_micro_r': '93.25', 'rel_micro_f': '81.74'}
threshold 0.20 	 {'rel_micro_p': '78.31', 'rel_micro_r': '91.71', 'rel_micro_f': '84.48'}
threshold 0.30 	 {'rel_micro_p': '80.33', 'rel_micro_r': '90.54', 'rel_micro_f': '85.13'}
threshold 0.40 	 {'rel_micro_p': '81.47', 'rel_micro_r': '89.76', 'rel_micro_f': '85.41'}
threshold 0.50 	 {'rel_micro_p': '82.13', 'rel_micro_r': '88.14', 'rel_micro_f': '85.03'}
threshold 0.60 	 {'rel_micro_p': '82.73', 'rel_micro_r': '86.08', 'rel_micro_f': '84.37'}
threshold 0.70 	 {'rel_micro_p': '83.21', 'rel_micro_r': '83.49', 'rel_micro_f': '83.35'}
threshold 0.80 	 {'rel_micro_p': '83.88', 'rel_micro_r': '79.29', 'rel_micro_f': '81.52'}
threshold 0.90 	 {'rel_micro_p': '85.05', 'rel_micro_r': '71.74', 'rel_micro_f': '77.83'}
threshold 0.99 	 {'rel_micro_p': '86.49', 'rel_micro_r': '29.31', 'rel_micro_f': '43.78'}
threshold 1.00 	 {'rel_micro_p': '80.15', 'rel_micro_r': '3.94', 'rel_micro_f': '7.51'}

Secondly, How shall I remove False Positive? Because I can see a lot of extra words are being predicted wrong while inferencing.
Below image show the predicted output. Right prediction should be "Borrower" but it is predicting extra word with "is" as "Party_Role". Similarly you can see "is" is predicted as "PARTY_ADDRESS" which wrong and it is False Positive. So, How can we take care of such wrong predicitons? Any suggestions would be helpful.

mbrunecky Dec 2, 2021
Author

I am not sure how much I can help here. When I am using command line spacy evaluate against the 'dev' set using in training (same data), I am getting exactly the same results as the ones reported during training - as long as I use max_length=0. But I am doing NER only, I have no experience with relationships.

That said, I recall that you were (experimenting?) with limiting the dev corpus data size with max_length=512.
I found that being a mine field. Besides Spacy folks warnings, my reading is that when splitting inputs that way during training (dev corpus), you may end up with segments ('documents') having NO annotations. And my reading of Spacy code in scorer.get_ner_prf:

for eg in examples:
        if not eg.y.has_annotation("ENT_IOB"):
            continue

is telling me that if I happen to have a document (split) example with no NER annotations, it will be skipped - not counted. Hence any false positive predictions against that document(split) will not be counted., resulting in under counting fp during training. The command line evaluate uses Corpus.max_length=0 (and no augmenter), so it will never happen there.
But while this happens in my data (having sparse annotations), I doubt this is often the case. And this is just a suspicion, I am not harping on Spacy folks - given their limited resources, they are doing an amazing job.

PERHAPS you need take a look at your NER first. The 'aggregate' scores may be misleading. For example Spacy en_core_web_lg comes with nice scores, but when you dive in and look at per-entity scores, you find that it is due to several 'low hanging fruit' entities (i.e. CARDINAL or PERSON) while others (i.e. WORK_OF_ART) have dismal scores.

karndeepsingh Dec 3, 2021

@mbrunecky That means, you are also facing the same issue. So, when you evaluate it on the dev set using evaluate.py the metrics reported by the model is misleading right? But while training we can see evaluation done at every 500 iterations, showing good accuracy. So, Which accuracy i need to consider? The evaluation frequency reported while training or information printed while running evaluate.py after the model got trained. My main concern is, how the reporting of accuracy should be done to clients when we see this kind of information printed.

mbrunecky Dec 6, 2021
Author

What I am saying is that it depends on what your down stream application does.
At the model training level, if you are looking for 'dogs' and 'birds' and have a document in which predictions correctly marks 9 instances of 'poodle' and misses one instance of 'eagle', it will be 90% accurate. And if your application cares about finding 'animals', the 90% will apply to application as well. But if the application cares about finding all kinds of 'dogs' (there was 1, 'poodle') and all 'birds' (it missed one), the application may look at it as 50% accuracy.

But I think your problem is at the model level, that you are getting different scores for different evaluation data sets. There may be multiple reasons for that, but for a well-trained model this should not be the case.
I would make really sure that the data used in training as validation sample ('dev') and any other data sets are (pre)processed using the same set of rules. After that, it may be that your 'dev' sample is NOT representing the 'other' data. Or that your model is either under-trained or over-trained.

karndeepsingh · 2021-12-17T13:45:30Z

karndeepsingh
Dec 17, 2021

@mbrunecky Hi Any Idea Why I am getting the "UNMARRIED" word's each character is being Predicted as PARTY IDCODE rather than whole "UNMARRIED" word predicted as "PARTY_IDCODE". Please help me to understand.
Please refer below image:

1 reply

mbrunecky Dec 18, 2021
Author

Sorry I can't help too much. This 'looks' like a tokenization problems. It seems you are using OCR data, and it is possible that your capitalized text is actually reported as individual letters (by the OCR engine). Or that whatever tokenizer you use turns that text into multiple 'words'. Old Spacy2 used JSON document data format, and I used it to load document and look at what Spacy tokenization (and BILO markup) did to me. Because your training is probably teaching your model that PARTY_NAME is (often) followed by PARTY_IDCODE, this becomes what the model learns to predict.

The other issue is 'false positives'. I keep struggling with those, because I can get recall 0.97 on 'dev' or a huge validation set, but then on the 'real data' I get more false positives (and I do not think I am over-training). The only thing I am trying is being very careful about training data. I am often throwing away up to 20% of data having some 'deficiency'. The problem is that whenever I come with yet another great 'throw it away' rule, chances are 50/50 that applying this rule makes things worse...

I do not know how are you creating your training markup, but human operators tend to be error prone... and machine rules-generated markup has it's own sets of pitfalls. In my years of pretending to work in AI, I learned AI has the same tendency as little kids: given opportunity, they both take the worst lesson possible :-).

Uh oh!

Sizing and controlling GPU memory for training #9451

Uh oh!

Replies: 6 comments · 16 replies

Uh oh!

Uh oh!

Uh oh!

mbrunecky Oct 20, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbrunecky Nov 26, 2021 Author

Uh oh!

Uh oh!

Uh oh!

mbrunecky Nov 27, 2021 Author

Uh oh!

mbrunecky Nov 27, 2021 Author

Uh oh!

Uh oh!

Uh oh!

mbrunecky Nov 28, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbrunecky Nov 28, 2021 Author

Uh oh!

mbrunecky Nov 29, 2021 Author

Uh oh!

Uh oh!

mbrunecky Dec 2, 2021 Author

Uh oh!

Uh oh!

mbrunecky Dec 6, 2021 Author

Uh oh!

Uh oh!

Uh oh!

mbrunecky Dec 18, 2021 Author

Replies: 6 comments 16 replies

mbrunecky Oct 20, 2021
Author

mbrunecky
Nov 26, 2021
Author

mbrunecky Nov 27, 2021
Author

mbrunecky
Nov 27, 2021
Author

mbrunecky Nov 28, 2021
Author

mbrunecky
Nov 28, 2021
Author

mbrunecky Nov 29, 2021
Author

mbrunecky Dec 2, 2021
Author

mbrunecky Dec 6, 2021
Author

mbrunecky Dec 18, 2021
Author