How to retain original text structure #11559

strumdude · 2022-09-29T13:10:45Z

strumdude
Sep 29, 2022

I am fairly new to NLP, so please be gentle ;)
I am processing a document (from PDF or docx) through a Spacy pipeline and identifying important words for a child to learn using various inputs. This works fine using plain text, but ultimately, the document needs to be rendered out in HTML and the important words highlighted and linked to other data, therefore I need to find solutions to two problems:

How to retain the original structure (headings, paragraphs, etc). I appreciate that this is outside of Spacys normal scope.
Our front end devs need to know which words to highlight and to link it to external data. To do that, I assume that I will need to add custom spans to the important words, then render out the entire document in HTML with every word having an id set to its Spacy token_i?

Any pointers would be most welcome!
Thanks

Answered by pmbaumgartner

Sep 29, 2022

Hey @strumdude, thanks for the question! I've had to solve a similar problem with docx files in the past, so hopefully this is of some help.

First, you'll need to make some association between the structure and the content of the document. For example, if you had a document with a heading and a paragraph, you might use a tuple of (element, content) like this:

document_sections = [
    ("heading", "Introduction to Physics"),
    ("paragraph", "The main goal of physics is to explain how things move in space and time.")
]

This assumes you can somehow extract that structural information from the document. If you're using python-docx, I think this should be possible given the input document …

View full answer

pmbaumgartner · 2022-09-29T13:57:12Z

pmbaumgartner
Sep 29, 2022

Hey @strumdude, thanks for the question! I've had to solve a similar problem with docx files in the past, so hopefully this is of some help.

First, you'll need to make some association between the structure and the content of the document. For example, if you had a document with a heading and a paragraph, you might use a tuple of (element, content) like this:

document_sections = [
    ("heading", "Introduction to Physics"),
    ("paragraph", "The main goal of physics is to explain how things move in space and time.")
]

This assumes you can somehow extract that structural information from the document. If you're using python-docx, I think this should be possible given the input document uses styles correctly. If the input documents just use formatting to indicate specific elements (e.g. only bolding headings), you might have a hard time with this.

Given that type of structure, you can then create Doc objects from that content and structural data using a custom attribute. In this example, we'll also use a small English model, but you might want to use a larger pre-trained model depending on your use case.

from spacy.tokens import Doc
import spacy

nlp = spacy.load("en_core_web_sm")

# Register extensions to hold metadata.
Doc.set_extension("element", default=None)

sections = []
for (element, text) in document_sections:
    doc = nlp.make_doc(text)
    doc._.element = element
    sections.append(doc)

processed_sections = list(nlp.pipe(sections))

# additional processing logic here

Now processed_sections contains documents that have been run through the pipeline. However, since I don't know the logic of what constitutes an "important word" for your use case, I've skipped over that step. My general suggestion is that the doc.spans container with SpanGroups. Assuming you're able to save important words using something like doc.spans['important_words']. A span group contains Span objects, which have start and end attributes you can use to access the token index of that span within the doc. To make things consistent and modular, you could wrap this logic in a custom component and add it to the pipeline, so that all of this is performed when the doc is run through nlp.pipe.

Finally, you would use these span groups combined with your own processing logic with the element types to render your HTML correctly. And remember, you've retained the original structure on the element attribute, which you can access with doc._.element.

Also note that there is similar functionality available in displaCy, so if you're prototyping it might be beneficial to use that output for now and then your front-end team could modify the look and feel of that. It's a little more complex, but you can also take a look at the code for displaCy to understand how it renders spans to HTML as well.

9 replies

strumdude Sep 30, 2022
Author

@pmbaumgartner
This process creates multiple docs each with a separate token index. I need a unique token id for each word. Is that possible whilst also retaining the relationship with the structural element?
Thanks

pmbaumgartner Sep 30, 2022

You could use the length of the section plus the token index within the section to do that. len(doc) will give you the number of tokens, so if you're tracking that per section, you can identify each token by accumulating section lengths then adding that to token.i for each token.

strumdude Oct 3, 2022
Author

@pmbaumgartner Hi Peter - I dropped you an email on Friday. Just checking that you received it and its not in spam! Thanks

strumdude Oct 7, 2022
Author

I have created a new custom attribute on tokens called nid that loops through all tokens and numbers them sequentially. That seems to work for my use case, but can I make use of that when identifying spans? .start and .end are using the token.i. Can they use token.nid? Thanks

pmbaumgartner Oct 11, 2022

@strumdude That shouldn't be a problem. Each span contains the set of tokens you can iterate over, so you should just need to do something like nids = [token._.nid for token in span]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to retain original text structure #11559

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to retain original text structure #11559

Uh oh!

strumdude Sep 29, 2022

Replies: 1 comment · 9 replies

Uh oh!

pmbaumgartner Sep 29, 2022

Uh oh!

strumdude Sep 30, 2022 Author

Uh oh!

pmbaumgartner Sep 30, 2022

Uh oh!

strumdude Oct 3, 2022 Author

Uh oh!

strumdude Oct 7, 2022 Author

Uh oh!

pmbaumgartner Oct 11, 2022

strumdude
Sep 29, 2022

Replies: 1 comment 9 replies

pmbaumgartner
Sep 29, 2022

strumdude Sep 30, 2022
Author

strumdude Oct 3, 2022
Author

strumdude Oct 7, 2022
Author