Categorise elements of list #10275

hmltn-0 · 2022-02-14T02:12:16Z

hmltn-0
Feb 14, 2022

I understand Spacy has a text categoriser.

After segmenting a programming technical manual, I want it to decide which text should be split on newlines, and which should rather be joined on newlines.

Because after segmentation some text is like this:

HEADER

list element

etc

And some are sentences with newlines in them:

This is a sentence with
a newline in it.

So the categoriser should decide if it looks like a broken sentence or not.

Can anyone run me through how to do that in Spacy?

Thank you very much

Answered by polm

Feb 14, 2022

Have you tried just taking a rule based approach to this? Based on your example the list elements have a blank space between each one so there's no ambiguity. You could also check the length of a line + the first word of the next line to see if a line was wrapped.

You could use the textcat in spaCy for this, but it's designed for classification based on content - like splitting newspaper articles or product descriptions into categories. Your problem here is more about whether joining the lines is grammatical or how the layout should be interpreted. You might have more luck training a sentence recognizer.

If you want to train a text classifier in spaCy anyway you would need to just label y…

View full answer

polm · 2022-02-14T05:08:07Z

polm
Feb 14, 2022

Have you tried just taking a rule based approach to this? Based on your example the list elements have a blank space between each one so there's no ambiguity. You could also check the length of a line + the first word of the next line to see if a line was wrapped.

You could use the textcat in spaCy for this, but it's designed for classification based on content - like splitting newspaper articles or product descriptions into categories. Your problem here is more about whether joining the lines is grammatical or how the layout should be interpreted. You might have more luck training a sentence recognizer.

If you want to train a text classifier in spaCy anyway you would need to just label your data. You could either label each blank-line-separated chunk (maybe as header/list/paragraph), or you could label each line (single/starter/continuous).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Categorise elements of list #10275

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Categorise elements of list #10275

Uh oh!

hmltn-0 Feb 14, 2022

Replies: 1 comment

Uh oh!

polm Feb 14, 2022

hmltn-0
Feb 14, 2022

polm
Feb 14, 2022