Categorise elements of list #10275
-
I understand Spacy has a text categoriser. After segmenting a programming technical manual, I want it to decide which text should be split on newlines, and which should rather be joined on newlines. Because after segmentation some text is like this: HEADER list element list element etc And some are sentences with newlines in them: This is a sentence with So the categoriser should decide if it looks like a broken sentence or not. Can anyone run me through how to do that in Spacy? Thank you very much |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Have you tried just taking a rule based approach to this? Based on your example the list elements have a blank space between each one so there's no ambiguity. You could also check the length of a line + the first word of the next line to see if a line was wrapped. You could use the textcat in spaCy for this, but it's designed for classification based on content - like splitting newspaper articles or product descriptions into categories. Your problem here is more about whether joining the lines is grammatical or how the layout should be interpreted. You might have more luck training a sentence recognizer. If you want to train a text classifier in spaCy anyway you would need to just label your data. You could either label each blank-line-separated chunk (maybe as header/list/paragraph), or you could label each line (single/starter/continuous). |
Beta Was this translation helpful? Give feedback.
Have you tried just taking a rule based approach to this? Based on your example the list elements have a blank space between each one so there's no ambiguity. You could also check the length of a line + the first word of the next line to see if a line was wrapped.
You could use the textcat in spaCy for this, but it's designed for classification based on content - like splitting newspaper articles or product descriptions into categories. Your problem here is more about whether joining the lines is grammatical or how the layout should be interpreted. You might have more luck training a sentence recognizer.
If you want to train a text classifier in spaCy anyway you would need to just label y…