Skip to content
Discussion options

You must be logged in to vote

Since it sounds like you want sentences as output, you might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well.

Normally I would suggest you start with a rule-based Sentencizer first, but for detecting titles that can be a little difficult. Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline.

One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your a…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@julkhami
Comment options

@adrianeboyd
Comment options

@hmltn-0
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / senter Feature: Sentence Recognizer
4 participants