On the performance issues of space word segmentation, labeling, and naming entities #12917
-
In my comparative experiment with multiple tools, I discovered an interesting phenomenon. Firstly, the test object is Chinese, and the model is the zh of space_ Core_ Web_ Sm-3.5.0. The experiment for testing is about whether to use pipelines. When not using pipelines, the pos and NER speeds of the space are very fast, but after using pipelines, the overall speed becomes very slow. Why is this? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Hi @PythonCancer, could you provide a minimal reproducible example? We'd like to look into this. |
Beta Was this translation helpful? Give feedback.
-
Okay, I've revoked the above. The code can run directly. You only need to modify the directory, both for the model and the txt file. |
Beta Was this translation helpful? Give feedback.
Thanks for providing this! I think this boils down to a misunderstanding - as you cited in your original post, the tokenization is by far the slowest part:
Now, the pipeline can only process documents as quickly as its slowest component. It doesn't matter how fast POS and NER are, documents still need to be tokenized. I've run only the tokenizer in your setup, which ran with 135 sentences / s compared to 45 sentences / s without
.pipe()
.Summarized: You compared the speed of indiv…