Spacy extension with en_core_web_trf, GPU, and multiprocessing isn't working #10017

hardianlawi · 2022-01-07T08:56:45Z

hardianlawi
Jan 7, 2022

How to reproduce the behaviour

from multiprocessing import set_start_method

import spacy
import torch

from spacy.tokens import Span

set_start_method("spawn")
torch.set_num_threads(1)


def _tokens(sent):
    return [token.text for token in sent]


Span.set_extension("tokens", getter=_tokens, force=True)

spacy.prefer_gpu()
nlp = spacy.load('en_core_web_trf')

texts = ['This is a sentence.'] * 100
docs = nlp.pipe(texts, n_process=2)

for doc in docs:
    print([sent._.tokens for sent in doc.sents])

I got

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniforge3/envs/pod-classification/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ubuntu/miniforge3/envs/pod-classification/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute '_tokens' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniforge3/envs/pod-classification/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ubuntu/miniforge3/envs/pod-classification/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute '_tokens' on <module '__main__' (built-in)>

Setting n_process = 1 works fine for me.

Your Environment

spaCy version: 3.2.0
Platform: Linux-5.4.0-1060-aws-x86_64-with-glibc2.10
Python version: 3.8.12
Pipelines: en_core_web_trf (3.2.0), en_core_web_sm (3.2.0)

Answered by adrianeboyd

Jan 10, 2022

This isn't related to trf or GPU, just how extensions are set globally. With fork it would work, but with spawn the child processes don't get the global context with the Span.set_extension line. You would need to check that the extension is set somehow in a custom component.

You can see an example from the transformer component where we always check for the extension in __call__ so that spawn works:

https://github.com/explosion/spacy-transformers/blob/ecf619d9df645dbbd3648481afd547bc9f84a5e8/spacy_transformers/pipeline_component.py#L181-L194

I also suspect you've simplified your example for the report above? I can't replicate this directly. This works fine, aside from some warnings (multi…

View full answer

adrianeboyd · 2022-01-10T12:01:38Z

adrianeboyd
Jan 10, 2022

This isn't related to trf or GPU, just how extensions are set globally. With fork it would work, but with spawn the child processes don't get the global context with the Span.set_extension line. You would need to check that the extension is set somehow in a custom component.

You can see an example from the transformer component where we always check for the extension in __call__ so that spawn works:

https://github.com/explosion/spacy-transformers/blob/ecf619d9df645dbbd3648481afd547bc9f84a5e8/spacy_transformers/pipeline_component.py#L181-L194

I also suspect you've simplified your example for the report above? I can't replicate this directly. This works fine, aside from some warnings (multiprocessing on GPU is not recommended in general):

from multiprocessing import set_start_method
import spacy
from spacy.tokens import Span

def _tokens(sent):
    return [token.text for token in sent]

Span.set_extension("tokens", getter=_tokens, force=True)

if __name__ == "__main__":
    set_start_method("spawn")
    import torch
    torch.set_num_threads(1)
    spacy.prefer_gpu()
    nlp = spacy.load('en_core_web_trf')

    texts = ['This is a sentence.'] * 100
    docs = nlp.pipe(texts, n_process=2)

    for doc in docs:
        print([sent._.tokens for sent in doc.sents])

3 replies

hardianlawi Jan 10, 2022
Author

Thanks for the reply @adrianeboyd! I see. I actually tried to simplify my example. Below is the one I actually care about.

To reproduce it, create a file called spacy_extension.py with the codes below

from spacy.tokens import Span


def capitalize(s):
    if len(s) == 0:
        return s
    new_s = s[0].upper()
    new_s += s[1:]
    return new_s


def capitalize_tokens(tokens):
    if len(tokens) > 0:
        tokens[0] = capitalize(tokens[0])
    return tokens


def space_after(sent):
    return [token.whitespace_.isspace() for token in sent]


def tokens(sent):
    return [token.text for token in sent]


def lemmas(sent):
    return [token.lemma_ for token in sent]


def tags(sent):
    return [token.tag_ for token in sent]


def capitalized_sentence(sent):
    return capitalize(sent.text)


def capitalized_tokens(sent):
    return capitalize_tokens([token.text for token in sent])


Span.set_extension(
    "space_after",
    getter=space_after,
    force=True,
)
Span.set_extension("tokens", getter=tokens, force=True)
Span.set_extension("lemmas", getter=lemmas, force=True)
Span.set_extension("tags", getter=tags, force=True)
Span.set_extension("capitalized_sentence", getter=capitalized_sentence, force=True)
Span.set_extension(
    "capitalized_tokens",
    getter=capitalized_tokens,
    force=True,
)

Then run the codes below via another file called main.py

from spacy_extension import *
from multiprocessing import set_start_method

import spacy
import torch

if __name__ == "__main__":
    set_start_method("spawn")
    torch.set_num_threads(1)
    spacy.prefer_gpu()
    nlp = spacy.load("en_core_web_trf")

    texts = ["This is a sentence."] * 100
    docs = nlp.pipe(texts, n_process=2)
    for doc in docs:
        print([sent._.lemmas for sent in doc.sents])

I got AttributeError: Can't pickle local object 'install_spacy_extensions.<locals>.<lambda>'

multiprocessing on GPU is not recommended in general

Yeah, but right now I'm just looking for ways to increase the speed of my pipeline. When the size of texts is about 500k, it took quite long to complete. Please let me know if there are other ways to optimize my pipeline (aside from disabling some components). :)

adrianeboyd Jan 10, 2022

The trf models are just a lot slower than the CNN models. You can see if the accuracy from en_core_web_lg is good enough for what you need to do.

What you can do if you have enough GPU RAM is to use one process with nlp.pipe() and increase the batch size to use more of the available RAM, so for example nlp.pipe(texts, batch_size=128). It's a bit brittle, so if you make batch_size too large it can crash, but you can definitely speed things up with larger batch sizes.

install_spacy_extensions looks like it's probably from a third-party library, so I'm not really sure why you're getting that error here. Can you try in a clean venv with just spacy and spacy-transformers and the remaining dependencies for en_core_web_trf?

If you do want to use extensions like this with spawn where later pipeline components depend on the extensions being registered, you can add a custom component at the beginning of the pipeline that adds the extensions with @Language.component as here, be sure to return the doc at the end: https://spacy.io/usage/processing-pipelines#custom-components

hardianlawi Jan 10, 2022
Author

You are right @adrianeboyd. I didn't notice install_spacy_extensions comes from another library because it's not written in the error log. Thanks a lot!

I will try your other suggestions to improve my pipeline as well. Have a great week ahead sir! 🙏🏼

I created this PR to fix the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy extension with en_core_web_trf, GPU, and multiprocessing isn't working #10017

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy extension with en_core_web_trf, GPU, and multiprocessing isn't working #10017

Uh oh!

Uh oh!

hardianlawi Jan 7, 2022

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Jan 10, 2022

Uh oh!

Uh oh!

hardianlawi Jan 10, 2022 Author

Uh oh!

adrianeboyd Jan 10, 2022

Uh oh!

Uh oh!

hardianlawi Jan 10, 2022 Author

hardianlawi
Jan 7, 2022

Replies: 1 comment 3 replies

adrianeboyd
Jan 10, 2022

hardianlawi Jan 10, 2022
Author

hardianlawi Jan 10, 2022
Author