On the performance issues of space word segmentation, labeling, and naming entities #12917

PythonCancer · 2023-08-16T08:04:46Z

PythonCancer
Aug 16, 2023

In my comparative experiment with multiple tools, I discovered an interesting phenomenon. Firstly, the test object is Chinese, and the model is the zh of space_ Core_ Web_ Sm-3.5.0. The experiment for testing is about whether to use pipelines. When not using pipelines, the pos and NER speeds of the space are very fast, but after using pipelines, the overall speed becomes very slow. Why is this?
The following three are the performance without using pipelines：
Spacy Tokenization Speed: 24.01 sentences/s
Spacy Pos Speed: 15787.84 sentences/s
Spacy NER Speed: 26013.88 sentences/s
The following is the performance of using pipelines：
Spacy Processing Speed (with pipeline): 45.68 sentences/s

Answered by rmitsch

Aug 23, 2023

Thanks for providing this! I think this boils down to a misunderstanding - as you cited in your original post, the tokenization is by far the slowest part:

Spacy Tokenization Speed: 24.01 sentences/s
Spacy Pos Speed: 15787.84 sentences/s
Spacy NER Speed: 26013.88 sentences/s
If you use .pipe(), you measure
Spacy Processing Speed (with pipeline): 45.68 sentences/s

Now, the pipeline can only process documents as quickly as its slowest component. It doesn't matter how fast POS and NER are, documents still need to be tokenized. I've run only the tokenizer in your setup, which ran with 135 sentences / s compared to 45 sentences / s without .pipe().

Summarized: You compared the speed of indiv…

View full answer

rmitsch · 2023-08-16T10:12:13Z

rmitsch
Aug 16, 2023
Maintainer

Hi @PythonCancer, could you provide a minimal reproducible example? We'd like to look into this.

3 replies

PythonCancer Aug 17, 2023
Author

psc60.txt
code.zip
Sorry for the text written on it. I feel it's too messy. Let me upload the code and the txt content of the original text directly. psc60.txt is the original text, and there are two code files in code.zip, including: test_ Spacy.py uses pipeline, test_ Spacy_ No. py does not use pipelines.

Test environment: 24 core CPU without GPU.

rmitsch Aug 17, 2023
Maintainer

To be clear, the code archive and the .txt file are sufficient to run your example? If so, please edit out the text in the previous post - this would make the thread more readable.

PythonCancer Aug 18, 2023
Author

Have you started analyzing this question yet? I'm quite anxious to know the result. Please let me know when you have a conclusion. Thanks!

PythonCancer · 2023-08-17T12:31:02Z

PythonCancer
Aug 17, 2023
Author

Okay, I've revoked the above. The code can run directly. You only need to modify the directory, both for the model and the txt file.

3 replies

rmitsch Aug 23, 2023
Maintainer

Thanks for providing this! I think this boils down to a misunderstanding - as you cited in your original post, the tokenization is by far the slowest part:

Spacy Tokenization Speed: 24.01 sentences/s
Spacy Pos Speed: 15787.84 sentences/s
Spacy NER Speed: 26013.88 sentences/s
If you use .pipe(), you measure
Spacy Processing Speed (with pipeline): 45.68 sentences/s

Now, the pipeline can only process documents as quickly as its slowest component. It doesn't matter how fast POS and NER are, documents still need to be tokenized. I've run only the tokenizer in your setup, which ran with 135 sentences / s compared to 45 sentences / s without .pipe().

Summarized: You compared the speed of individual components to the speed of the entire pipeline. You do get a speed-up when using .pipe(), also with your setup.

Answer selected by rmitsch

PythonCancer Aug 23, 2023
Author

Thank you for the detailed response. It indeed clarified the situation for me. I hadn't realized that the overall pipeline speed is constrained by its slowest component.

With that in mind, I'd like to inquire further about Spacy's internal optimizations. When processing a long paragraph with multiple sentences, does Spacy begin subsequent tasks (like POS and NER) right after tokenizing a sentence while still tokenizing other sentences in the background? In other words, does Spacy internally utilize any form of multithreading or asynchronous processing to enhance performance?

Thanks again for your support and response!

rmitsch Aug 23, 2023
Maintainer

I hadn't realized that the overall pipeline speed is constrained by its slowest component.

Note that this is not constrained to spaCy - if you have a pipeline of n subsequent steps, the slowest step will necessarily determine the overall pipeline speed.

When processing a long paragraph with multiple sentences, ...

spaCy always operates on document level, regardless of how long the documents are.

... does Spacy begin subsequent tasks (like POS and NER) right after tokenizing a sentence while still tokenizing other sentences in the background? In other words, does Spacy internally utilize any form of multithreading or asynchronous processing to enhance performance?

Parallelism in .pipe() works on a doc (batch) level. Within such batches all pipeline steps are performed sequentially. So spaCy utilizes multiprocessing, but not asynchronous processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

On the performance issues of space word segmentation, labeling, and naming entities #12917

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

On the performance issues of space word segmentation, labeling, and naming entities #12917

Uh oh!

PythonCancer Aug 16, 2023

Replies: 2 comments · 6 replies

Uh oh!

rmitsch Aug 16, 2023 Maintainer

Uh oh!

PythonCancer Aug 17, 2023 Author

Uh oh!

rmitsch Aug 17, 2023 Maintainer

Uh oh!

PythonCancer Aug 18, 2023 Author

Uh oh!

PythonCancer Aug 17, 2023 Author

Uh oh!

rmitsch Aug 23, 2023 Maintainer

Uh oh!

PythonCancer Aug 23, 2023 Author

Uh oh!

rmitsch Aug 23, 2023 Maintainer

PythonCancer
Aug 16, 2023

Replies: 2 comments 6 replies

rmitsch
Aug 16, 2023
Maintainer

PythonCancer Aug 17, 2023
Author

rmitsch Aug 17, 2023
Maintainer

PythonCancer Aug 18, 2023
Author

PythonCancer
Aug 17, 2023
Author

rmitsch Aug 23, 2023
Maintainer

PythonCancer Aug 23, 2023
Author

rmitsch Aug 23, 2023
Maintainer