Spacy extension with en_core_web_trf, GPU, and multiprocessing isn't working #10017
-
How to reproduce the behaviourfrom multiprocessing import set_start_method
import spacy
import torch
from spacy.tokens import Span
set_start_method("spawn")
torch.set_num_threads(1)
def _tokens(sent):
return [token.text for token in sent]
Span.set_extension("tokens", getter=_tokens, force=True)
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_trf')
texts = ['This is a sentence.'] * 100
docs = nlp.pipe(texts, n_process=2)
for doc in docs:
print([sent._.tokens for sent in doc.sents]) I got
Setting Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
This isn't related to You can see an example from the I also suspect you've simplified your example for the report above? I can't replicate this directly. This works fine, aside from some warnings (multiprocessing on GPU is not recommended in general):
|
Beta Was this translation helpful? Give feedback.
This isn't related to
trf
or GPU, just how extensions are set globally. With fork it would work, but with spawn the child processes don't get the global context with theSpan.set_extension
line. You would need to check that the extension is set somehow in a custom component.You can see an example from the
transformer
component where we always check for the extension in__call__
so that spawn works:https://github.com/explosion/spacy-transformers/blob/ecf619d9df645dbbd3648481afd547bc9f84a5e8/spacy_transformers/pipeline_component.py#L181-L194
I also suspect you've simplified your example for the report above? I can't replicate this directly. This works fine, aside from some warnings (multi…