Parallel processing of pre tokenized Texts via nlp.pipe() #8969

Fxlix · 2021-08-16T08:12:39Z

Fxlix
Aug 16, 2021

We built a tokenizer in perl that met our needs better than the SpaCy internal tokenizer.
We pass the tokens as a list via
doc = Doc(nlp.vocab, words=tokens, spaces=spaces, sent_starts=sent_starts)
for name, proc in nlp.pipeline:
doc = proc(doc)

But now we have to compute a large corpus of documents and would like to speed up processing.

So we came across nlp.pipe(), but i can't find a way to pass tokens into nlp.pipe().

calling nlp.tokenizer = nlp.tokenizer.tokens_from_list throws errors in spacy3.

What would be the best way to compute a large amount of pre tokenized documents?
Thanks in advance,
Felix

adrianeboyd · 2021-08-16T10:26:41Z

adrianeboyd
Aug 16, 2021

This is something that's not as easy to do as it should be. We've thought about letting you pass in Doc objects to nlp and nlp.pipe, in which case it would just skip the tokenization step, but this hasn't been implemented yet.

Probably the simplest option is to call .pipe for the individual components in the same loop:

for name, proc in nlp.pipeline:
    docs = proc.pipe(docs)

Whether this works easily depends a bit on what's in your pipeline, since not all components implement .pipe. Internally we use a helper function in spacy.util._pipe to manage cases with and without .pipe. This could be simplified a lot if you're not concerned about modifying the additional settings like the error handler, so something like:

for name, proc in nlp.pipeline:
    if hasattr(proc, "pipe"):
        docs = proc.pipe(docs)
    else:
        docs = (proc(doc) for doc in docs)

for doc in docs:
    ...

If you'd like to call nlp.pipe the usual way, you'll need either perl preprocessing+custom tokenizer settings or a full custom tokenizer, depending on how to want to set up the perl step. Let me know if you'd like more info about those options.

1 reply

Fxlix Aug 19, 2021
Author

Thank you for your detailed answer. I will try your approach, when we finished the computations. At the moment I'm leaving it as it is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parallel processing of pre tokenized Texts via nlp.pipe() #8969

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parallel processing of pre tokenized Texts via nlp.pipe() #8969

Uh oh!

Fxlix Aug 16, 2021

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

adrianeboyd Aug 16, 2021

Uh oh!

Fxlix Aug 19, 2021 Author

Fxlix
Aug 16, 2021

Replies: 1 comment 1 reply

adrianeboyd
Aug 16, 2021

Fxlix Aug 19, 2021
Author