Skip to content
Discussion options

You must be logged in to vote

You are correct that spaCy can have better results working with paragraphs than individual sentences, but it doesn't benefit from context much longer than a paragraph.

The question is how to split the text into reasonable parts – can spaCy help me here? For example, split the text by a number of characters that are still safe to process and the last sentence is not split?

There are some tools kind of like that for Transformers (see span_getters) but not for the CPU models. In particular for the CPU models, the text that's safe to process is going to be much longer than the longest segment that's useful to process.

Can you just split your input into paragraphs, perhaps by splitting on do…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by dsenkyr
Comment options

You must be logged in to vote
1 reply
@polm
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf / memory Performance: memory use
2 participants
Converted from issue

This discussion was converted from issue #9508 on October 20, 2021 04:40.