How to retain original text structure #11559
-
I am fairly new to NLP, so please be gentle ;)
Any pointers would be most welcome! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
Hey @strumdude, thanks for the question! I've had to solve a similar problem with docx files in the past, so hopefully this is of some help. First, you'll need to make some association between the structure and the content of the document. For example, if you had a document with a heading and a paragraph, you might use a tuple of
This assumes you can somehow extract that structural information from the document. If you're using Given that type of structure, you can then create
Now Finally, you would use these span groups combined with your own processing logic with the element types to render your HTML correctly. And remember, you've retained the original structure on the Also note that there is similar functionality available in displaCy, so if you're prototyping it might be beneficial to use that output for now and then your front-end team could modify the look and feel of that. It's a little more complex, but you can also take a look at the code for displaCy to understand how it renders spans to HTML as well. |
Beta Was this translation helpful? Give feedback.
Hey @strumdude, thanks for the question! I've had to solve a similar problem with docx files in the past, so hopefully this is of some help.
First, you'll need to make some association between the structure and the content of the document. For example, if you had a document with a heading and a paragraph, you might use a tuple of
(element, content)
like this:This assumes you can somehow extract that structural information from the document. If you're using
python-docx
, I think this should be possible given the input document …