Advice on custom segmentation #9819

julkhami · 2021-12-06T23:23:05Z

julkhami
Dec 6, 2021

I’d like to segment primarily web articles, especially those containing lots of code (i.e. documentation).

What would be the best way to segment with the highest, most perfect possible accuracy?

Some very simple rule based segmenter? I think I read the difficulty with rule based segmentation is you have to have a list of abbreviations that will come up on hand. Does Spacy offer a method which has all abbreviations in English?

If that method is too finicky and prone to mistake when exceptions occur, I assume something with AI would be better. I’ve been considering training the segmenter myself. Maybe using Prodigy I can tune my model to segment in all the ways I want it to?

Does anyone have any advice for me?

The key idea is separating sentences, or self-contained text entities that are broken by newlines (like article titles), and keeping lines of code separate. So it should segment on newlines IFF it’s not in the middle of a sentence. That’s the challenge.

Thanks!

Answered by polm

Dec 7, 2021

Since it sounds like you want sentences as output, you might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well.

Normally I would suggest you start with a rule-based Sentencizer first, but for detecting titles that can be a little difficult. Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline.

One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your a…

View full answer

polm · 2021-12-07T04:48:04Z

polm
Dec 7, 2021

Since it sounds like you want sentences as output, you might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well.

Normally I would suggest you start with a rule-based Sentencizer first, but for detecting titles that can be a little difficult. Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline.

One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your article layout.

Some very simple rule based segmenter? I think I read the difficulty with rule based segmentation is you have to have a list of abbreviations that will come up on hand. Does Spacy offer a method which has all abbreviations in English?

I think there might be a list of abbreviations in the tokenizer exceptions somewhere, but it's not intended to be comprehensive, and at the point you're trying to get a comprehensive list of abbreviations you're better off using a statistical model.

The key idea is separating sentences, or self-contained text entities that are broken by newlines (like article titles), and keeping lines of code separate. So it should segment on newlines IFF it’s not in the middle of a sentence. That’s the challenge.

One question about this: do you actually have newlines in the middle of sentences? You say you're using web articles, and I would not expect web articles to include newlines normally in the middle of paragraphs. So you might want to be sure that you actually have to deal with that first.

One other approach you could take is to use a SpanCategorizer on each line (or newline) to detect if it's a title, code, or ordinary prose. I haven't heard of anyone using a spancat this way, but it is within the bounds of how it's intended to be used, and it should be able to create annotations for it. After you have classified the lines you can recreate the document, removing newlines that aren't sentence boundaries.

Basically there are a lot of ways you could approach this. The thing about sentence segmentation is that usually it's easy to get good-enough segmentation and incredibly hard to be perfect all the time - especially since most data will contain a few cases where the right answer is unclear - so for many applications it's just kind of an afterthought. I'm not super up to date on the literature on this, but I think there is more attention to this problem in processing legal documents, where they have more complicated document structure, or use a modified sentence definition to deal with the weird grammar in legal documents.

3 replies

julkhami Dec 7, 2021
Author

Thanks for your message.

You might want to look at training a SentenceRecognizer. You can provide documents with the sentence boundaries marked as training data, and it should be able to learn that relatively well.

Do you have a recommended way to set that up? I thought it would be really cool to mark them in Prodigy. Would that work?

Maybe if they use capitalization consistently you can use that to develop some rules and get a rule-based model you can use as a baseline.

I’m using text scraped from raw HTML. So there are plenty of random fragments floating around. I guess I should experiment with the Python re module to see if I can successfully discover some rules.

One thing you could do if titles are positioned consistently (like at the start of the document) is use a custom component that works on that information. But that depends on lot on your article layout.

It’s very haphazard and random in terms of formatting. I think it needs something akin to human intelligence to recognise what coheres, syntactically and semantically.

I think there might be a list of abbreviations in the tokenizer exceptions somewhere, but it's not intended to be comprehensive, and at the point you're trying to get a comprehensive list of abbreviations you're better off using a statistical model.

One question about this: do you actually have newlines in the middle of sentences?
You say you're using web articles, and I would not expect web articles to include newlines normally in the middle of paragraphs. So you might want to be sure that you actually have to deal with that first.

I do because I’m fetching the HTML source (with requests) and then I’m pulling out the text (I’ve tried Beautiful Soup and html2text but I’m trying other methods currently.)

I haven’t found a way to consistently extract perfectly so the end result is always a bit messy with a few broken sentences here and there.

One other approach you could take is to use a SpanCategorizer on each line (or newline) to detect if it's a title, code, or ordinary prose.

That’s a very good idea as I was previously trying to do exactly this but didn’t know how. By pre-classifying the lines I can organize them much more easily.

I haven't heard of anyone using a spancat this way, but it is within the bounds of how it's intended to be used, and it should be able to create annotations for it. After you have classified the lines you can recreate the document, removing newlines that aren't sentence boundaries.

Basically there are a lot of ways you could approach this. The thing about sentence segmentation is that usually it's easy to get good-enough segmentation and incredibly hard to be perfect all the time - especially since most data will contain a few cases where the right answer is unclear - so for many applications it's just kind of an afterthought. I'm not super up to date on the literature on this, but I think there is more attention to this problem in processing legal documents, where they have more complicated document structure, or use a modified sentence definition to deal with the weird grammar in legal documents.

Alright, thanks. Sounds like I have a lot of research to do. But for now the best next attempt I can think of is training the sentence recognizer as you said. Would you mind giving me advice on how I train it?

Thanks very much.

adrianeboyd Dec 7, 2021

Prodigy has built-in sent recipes for annotating sentence boundaries: https://prodi.gy/docs/recipes#sent. (If you want to start with any other spacy trained pipeline other than xx_sent_ud_sm, you have to deal with the fact that the senter is disabled by default. You would need to load the model, enable the senter, disable/remove everything else, and save that to disk for use as the model path.)

hmltn-0 Dec 23, 2021

Thanks.
I’m coming back to this now.

For reference, I would like to see this split on distinct lines of code OR connected sentences:

help(help):

class _Helper(builtins.object)
SPLIT
| Define the builtin 'help'.
SPLIT
| This is a wrapper around pydoc.help that provides a helpful message
| when 'help' is typed at the Python interactive prompt.
SPLIT
| Calling help() at the Python prompt starts an interactive help session.
SPLIT
| Calling help(thing) prints help for the python object 'thing'.
SPLIT
| Methods defined here:
SPLIT
| call(self, *args, **kwds)
SPLIT
| Call self as a function.
SPLIT

etc.

I am pretty sure rule based segmentation is out of the question. It will be hard to specify “periods break sentences except if it’s code, newlines split segments unless it’s in a sentence) - not impossible, just not simple.

So I’ll definitely go for the Prodigy sentence trainer you suggest, sounds like the simplest approach.

Thanks very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Advice on custom segmentation #9819

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Advice on custom segmentation #9819

Uh oh!

julkhami Dec 6, 2021

Replies: 1 comment · 3 replies

Uh oh!

polm Dec 7, 2021

Uh oh!

julkhami Dec 7, 2021 Author

Uh oh!

adrianeboyd Dec 7, 2021

Uh oh!

hmltn-0 Dec 23, 2021

julkhami
Dec 6, 2021

Replies: 1 comment 3 replies

polm
Dec 7, 2021

julkhami Dec 7, 2021
Author