Side-effects of as_tuples in multiprocessing #10354

svonava · 2022-02-22T14:55:36Z

svonava
Feb 22, 2022

Hey everybody - first of all, spacy is awesome!

Secondly: I'm seing some weird interplay between as_tuples and multiprocessing (n_process).

What I did:

Use en_core_web_lg, run nlp.pipe() over 1k docs with batch_size 100, n_process 10.
I assume that A & B can have different order here: "for A in nlp.pipe(B)" and I turned on as_tuples to align A & B after the pipe is done.
as_tuples=True made my code take an order of magnitude more RAM, as if only once I set this to true it actually creates 10 different representations of en_core_web_lg in memory.

Is this expected? Does this mean that when as_tuples=False, the multiprocessing doesn't separate the processes properly?

Thank you for any elaboration on this!

Originally posted by @svonava in #9597 (comment)

Answered by adrianeboyd

Feb 23, 2022

nlp.pipe() returns the documents in the same order as the input texts.

as_tuples is just to pair some external context with each returned doc, so if you only have input texts, you shouldn't need to use it. (The as_tuples option isn't really needed as of v3.2 because you can pass docs with custom attributes to the pipeline instead.)

Unless you have really limited RAM or really long texts, 100 is a pretty small batch size for en_core_web_lg. It might well be faster to have larger batch sizes and fewer processes.

View full answer

adrianeboyd · 2022-02-23T14:36:13Z

adrianeboyd
Feb 23, 2022

nlp.pipe() returns the documents in the same order as the input texts.

as_tuples is just to pair some external context with each returned doc, so if you only have input texts, you shouldn't need to use it. (The as_tuples option isn't really needed as of v3.2 because you can pass docs with custom attributes to the pipeline instead.)

Unless you have really limited RAM or really long texts, 100 is a pretty small batch size for en_core_web_lg. It might well be faster to have larger batch sizes and fewer processes.

3 replies

svonava Feb 23, 2022
Author

nlp.pipe() returns the documents in the same order as the input texts

It would be useful to emphasise this in the docs, because it's not obvious that this is guaranteed in the n_process > 1 mode.

as_tuples not needed, use custom attributes instead

Custom attributes live in Doc/Span/Token objects, whereas I needed to pass context information into the nlp.pipe() call - based on [1] I'm not sure how I'd do that. But I don't need to do it anymore, if the order is guaranteed.

[1] https://spacy.io/usage/processing-pipelines#custom-components-attributes

adrianeboyd Feb 25, 2022

No need to worry about custom attributes for your task, then. The API docs do specify that the docs are returned in order for Language.pipe: https://spacy.io/api/language#pipe. Was there somewhere in the docs where you felt this was unclear?

svonava Feb 26, 2022
Author

I saw that, but was not sure if it's also true in the multiprocessing scenario. Adding language to reaffirm this in the docs would increase clutter - so I'd only do it if more people get confused :)

Thank you for your patience here @adrianeboyd !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Side-effects of as_tuples in multiprocessing #10354

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Side-effects of as_tuples in multiprocessing #10354

Uh oh!

Uh oh!

svonava Feb 22, 2022

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Feb 23, 2022

Uh oh!

svonava Feb 23, 2022 Author

Uh oh!

adrianeboyd Feb 25, 2022

Uh oh!

svonava Feb 26, 2022 Author

svonava
Feb 22, 2022

Replies: 1 comment 3 replies

adrianeboyd
Feb 23, 2022

svonava Feb 23, 2022
Author

svonava Feb 26, 2022
Author