Spacy hanging for badly formatted texts #13600
Unanswered
morbidCode
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello all!
I am using the latest version of spacy I installed from pip with the "en_core_web_sm" model. I am chunking lots of documents using sentence detection (sentor component). The documents have no structure and no specific formatting, and a few of them have extremely bad formatting.
Most of the time, my code is working well. However, there are rare cases where spacy encounters a badly formatted document, and hangs indefinitely: For example,
The problem is the sentence chunking is happening inside a loop, and if one hangs, the rest will not be processed.
Is there a way for spacy to throw an error if it encounters a text that it can't handle so I can skip it gracefully and proceed to the next documents? Or what would be the ideal approach?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions