Skip to content
Discussion options

You must be logged in to vote

Hey @strumdude, thanks for the question! I've had to solve a similar problem with docx files in the past, so hopefully this is of some help.

First, you'll need to make some association between the structure and the content of the document. For example, if you had a document with a heading and a paragraph, you might use a tuple of (element, content) like this:

document_sections = [
    ("heading", "Introduction to Physics"),
    ("paragraph", "The main goal of physics is to explain how things move in space and time.")
]

This assumes you can somehow extract that structural information from the document. If you're using python-docx, I think this should be possible given the input document …

Replies: 1 comment 9 replies

Comment options

You must be logged in to vote
9 replies
@strumdude
Comment options

@pmbaumgartner
Comment options

@strumdude
Comment options

@strumdude
Comment options

@pmbaumgartner
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage
3 participants