CUDA Out Of Memory error when running inference - NER transformer model #11970

exfalsoquodlibet · 2022-12-13T15:47:02Z

exfalsoquodlibet
Dec 13, 2022

Hello 👋

I've been dealing with CUDA Out Of Memory (OOM) errors when running inference with a fine-tuned NER transformer model on Google Compute Engine (GCE).

I'm aware this is not a new issue and there are a lot of discussions here about this problem, but most deal with the error at training stage, and nothing that I learnt from these discussions and implemented in my code has worked so far. Thus, I am hoping for some additional advice.

GCE Virtual Machine

machine-type: n1-standard-16
image-family: ubuntu-2004-lts
GPU: nvidia-tesla-t4
GPU VRAM: 16GB
CPUs: 16
Memory: 60GB
boot-disk-size: 100GB

Docker configuration:

Base image : nvidia/cuda:11.2.1-runtime-ubuntu20.04
Python version: 3.9

Packages installed via pip:

numpy==1.23.4
setuptools
spacy[cuda112]
spacy-transformers
google-auth==2.14.1
google.cloud-bigquery==3.3.6
google.cloud-storage==2.5.0
thinc_gpu_ops
GPUtil
torch --extra-index-url https://download.pytorch.org/whl/cu112

The model

A fine-tuned NER model, based on Spacy's en_core_web_trf. During fine-tuning, all default parameters were kept.

Inference code

Inference is done via a python module (extract_entities.py) which is called by a bash script (run.sh). The bash script is executed by the docker file.

What I attempted so far to address the CUDA OOM error

(1) used the GPU, with memory allocations directed via PyTorch.

In extract_entities.py:

from thinc.api set_gpu_allocator
set_gpu_allocator("pytorch")

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    print("Using GPU!")
    spacy.require_gpu()
else:
    print("GPU not found.")

(2) added the `doc_cleaner` component to the NER model pipeline

In extract_entities.py:

config = {"attrs": {"tensor": None}}
ner_model = spacy.load(path_to_model)
ner_model.add_pipe("doc_cleaner", config=config)

(3) disabled gradient calculation via `torch.no_grad()`

before calling nlp.pipe() in the main function in extract_entities.py. I thought this was redundant but...

def extract_entities_pipe(rows, ner_model, b, n):
    with torch.no_grad():
            for doc, meta in ner_model.pipe(rows, as_tuples=True, batch_size=b, n_process=1):
                    yield {
                        "url": meta[0],
                        "entities": [
                            {
                                "name": ent.text,
                                "type": ent.label_,
                                "start": ent.start_char,
                                "end": ent.end_char,
                            }
                            for ent in doc.ents
                        ],
                        "line_number": meta[1],
                    }

(4) Chunk up the stream of input texts (total 14mln lines of texts) and release memory at the end of each iteration

That included loading model and running inference on each chunk, then deleting model and output file, and releasing memory at the end of each iteration.

Simplifying, the gist of the code is:

for i, chunk_stream in enumerate(chunks(content_stream, chunk_size=50000)):
        
        config = {"attrs": {"tensor": None}}
        ner_model = spacy.load(path_to_model)
        ner_model.add_pipe("doc_cleaner", config=config)
        
        OUTPUT_FILENAME = f"entities_{i}.jsonl"
            
        # < code to extract entities and write output to jsonl file >
        entities = extract_entities_pipe(chunk_stream, ner_model, b=2000, n=1) 
        write_to_jsonl(entities, OUTPUT_FILENAME)

        # <code to upload jsonl to Google Storage >
        
        # delete stuff at the end of each iteration
        del OUTPUT_FILENAME
        del ner_model
        time.sleep(3)
        gc.collect()
        torch.cuda.empty_cache()

(5) Set environment variable `PYTORCH_CUDA_ALLOC_CONF`

In run.sh:

export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

Did this because it is what the CUDA OOM error suggests:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.12 GiB (GPU 0; 14.76 GiB total capacity; 10.18 GiB already allocated; 3.52 GiB free; 10.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(6) Pre-cleaned the input texts

Removed embedded URLs (they can be quite long)
Excluded "ill-formatted" texts, i.e. lines containing mostly non-ascii characters.

What else?

I am rather baffled to get the CUDA OOM error during inference. Grateful if you could help me figure out what else I can do or what I could do differently to prevent the error.

Answered by adrianeboyd

Dec 14, 2022

The main setting to adjust in inference is the batch size, either by modifying nlp.batch_size or nlp.pipe(batch_size=). See also: #8600

The batch size of 2000 in your script is a lot higher than the default of 64 in en_core_web_trf. Our usual default recommendations for trf pipelines are 64 or 128, so I would recommend starting in that range while testing and monitoring the maximum memory usage. If there is still lots of free memory, you can raise the batch size.

The maximum batch size that can run without OOM errors depends a lot on the document lengths, so you may need to take a look at the distribution of text lengths in your input data, because one extremely long text can push an indi…

View full answer

adrianeboyd · 2022-12-14T15:03:46Z

adrianeboyd
Dec 14, 2022

The main setting to adjust in inference is the batch size, either by modifying nlp.batch_size or nlp.pipe(batch_size=). See also: #8600

The batch size of 2000 in your script is a lot higher than the default of 64 in en_core_web_trf. Our usual default recommendations for trf pipelines are 64 or 128, so I would recommend starting in that range while testing and monitoring the maximum memory usage. If there is still lots of free memory, you can raise the batch size.

The maximum batch size that can run without OOM errors depends a lot on the document lengths, so you may need to take a look at the distribution of text lengths in your input data, because one extremely long text can push an individual batch over the limit. If your text lengths vary a lot, you may want to split long texts for processing to keep the memory usage similar across batches.

You shouldn't need to add torch.no_grad(), this is already set internally.

You should usually let pytorch handle the memory management automatically. If you're emptying the cache manually it may slow down processing and it probably isn't addressing the underlying issue that's leading to the OOM error.

1 reply

exfalsoquodlibet Dec 15, 2022
Author

Thanks very much @adrianeboyd, reducing batch_size to 64 during inference worked.

The length of my texts (as number of characters) is rather skewed with a long right tail but, overall, texts are rather short:

min length, first_quartile, median, third_quartile, max length
0,          31,             79,     184,            3685

It took ~14hour to run the inference on 12 millions of these texts, which makes experimenting with the value of this parameters rather unfeasible.

Thus, I hope I can follow up with a couple of related questions.

Based on your experience, would scaling the batch_size to 128 (pending no CUDA OOM error) speed up inference times significantly?
Is chunking up the stream of input texts to release memory at the end of each iteration also an unnecessary step to avoid the OOM error now that batch_size has been reduced?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA Out Of Memory error when running inference - NER transformer model #11970

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

CUDA Out Of Memory error when running inference - NER transformer model #11970

Uh oh!

exfalsoquodlibet Dec 13, 2022

GCE Virtual Machine

Docker configuration:

Packages installed via pip:

The model

Inference code

What I attempted so far to address the CUDA OOM error

(1) used the GPU, with memory allocations directed via PyTorch.

(2) added the doc_cleaner component to the NER model pipeline

(3) disabled gradient calculation via torch.no_grad()

(4) Chunk up the stream of input texts (total 14mln lines of texts) and release memory at the end of each iteration

(5) Set environment variable PYTORCH_CUDA_ALLOC_CONF

(6) Pre-cleaned the input texts

What else?

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Dec 14, 2022

Uh oh!

exfalsoquodlibet Dec 15, 2022 Author

exfalsoquodlibet
Dec 13, 2022

(2) added the `doc_cleaner` component to the NER model pipeline

(3) disabled gradient calculation via `torch.no_grad()`

(5) Set environment variable `PYTORCH_CUDA_ALLOC_CONF`

Replies: 1 comment 1 reply

adrianeboyd
Dec 14, 2022

exfalsoquodlibet Dec 15, 2022
Author