Parallel processing of pre tokenized Texts via nlp.pipe() #8969
Replies: 1 comment 1 reply
-
This is something that's not as easy to do as it should be. We've thought about letting you pass in Probably the simplest option is to call for name, proc in nlp.pipeline:
docs = proc.pipe(docs) Whether this works easily depends a bit on what's in your pipeline, since not all components implement for name, proc in nlp.pipeline:
if hasattr(proc, "pipe"):
docs = proc.pipe(docs)
else:
docs = (proc(doc) for doc in docs)
for doc in docs:
... If you'd like to call |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We built a tokenizer in perl that met our needs better than the SpaCy internal tokenizer.
We pass the tokens as a list via
doc = Doc(nlp.vocab, words=tokens, spaces=spaces, sent_starts=sent_starts)
for name, proc in nlp.pipeline:
doc = proc(doc)
But now we have to compute a large corpus of documents and would like to speed up processing.
So we came across nlp.pipe(), but i can't find a way to pass tokens into nlp.pipe().
calling nlp.tokenizer = nlp.tokenizer.tokens_from_list throws errors in spacy3.
What would be the best way to compute a large amount of pre tokenized documents?
Thanks in advance,
Felix
Beta Was this translation helpful? Give feedback.
All reactions