Advice on custom segmentation #9819
-
I’d like to segment primarily web articles, especially those containing lots of code (i.e. documentation). What would be the best way to segment with the highest, most perfect possible accuracy? Some very simple rule based segmenter? I think I read the difficulty with rule based segmentation is you have to have a list of abbreviations that will come up on hand. Does Spacy offer a method which has all abbreviations in English? If that method is too finicky and prone to mistake when exceptions occur, I assume something with AI would be better. I’ve been considering training the segmenter myself. Maybe using Prodigy I can tune my model to segment in all the ways I want it to? Does anyone have any advice for me? The key idea is separating sentences, or self-contained text entities that are broken by newlines (like article titles), and keeping lines of code separate. So it should segment on newlines IFF it’s not in the middle of a sentence. That’s the challenge. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Since it sounds like you want sentences as output, you might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well. Normally I would suggest you start with a rule-based Sentencizer first, but for detecting titles that can be a little difficult. Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline. One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your article layout.
I think there might be a list of abbreviations in the tokenizer exceptions somewhere, but it's not intended to be comprehensive, and at the point you're trying to get a comprehensive list of abbreviations you're better off using a statistical model.
One question about this: do you actually have newlines in the middle of sentences? You say you're using web articles, and I would not expect web articles to include newlines normally in the middle of paragraphs. So you might want to be sure that you actually have to deal with that first. One other approach you could take is to use a SpanCategorizer on each line (or newline) to detect if it's a title, code, or ordinary prose. I haven't heard of anyone using a spancat this way, but it is within the bounds of how it's intended to be used, and it should be able to create annotations for it. After you have classified the lines you can recreate the document, removing newlines that aren't sentence boundaries. Basically there are a lot of ways you could approach this. The thing about sentence segmentation is that usually it's easy to get good-enough segmentation and incredibly hard to be perfect all the time - especially since most data will contain a few cases where the right answer is unclear - so for many applications it's just kind of an afterthought. I'm not super up to date on the literature on this, but I think there is more attention to this problem in processing legal documents, where they have more complicated document structure, or use a modified sentence definition to deal with the weird grammar in legal documents. |
Beta Was this translation helpful? Give feedback.
Since it sounds like you want sentences as output, you might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well.
Normally I would suggest you start with a rule-based Sentencizer first, but for detecting titles that can be a little difficult. Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline.
One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your a…