Skip to content
Discussion options

You must be logged in to vote

Thanks for providing this! I think this boils down to a misunderstanding - as you cited in your original post, the tokenization is by far the slowest part:

Spacy Tokenization Speed: 24.01 sentences/s
Spacy Pos Speed: 15787.84 sentences/s
Spacy NER Speed: 26013.88 sentences/s
If you use .pipe(), you measure
Spacy Processing Speed (with pipeline): 45.68 sentences/s

Now, the pipeline can only process documents as quickly as its slowest component. It doesn't matter how fast POS and NER are, documents still need to be tokenized. I've run only the tokenizer in your setup, which ran with 135 sentences / s compared to 45 sentences / s without .pipe().

Summarized: You compared the speed of indiv…

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
3 replies
@PythonCancer
Comment options

@rmitsch
Comment options

@PythonCancer
Comment options

Comment options

You must be logged in to vote
3 replies
@rmitsch
Comment options

Answer selected by rmitsch
@PythonCancer
Comment options

@rmitsch
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / zh Chinese language data and models feat / pipeline Feature: Processing pipeline and components perf / speed Performance: speed
2 participants