0.17.2

ajjimeno released this 20 Mar 16:52

· 80 commits to main since this release

0fa5174

Enhancements

Add image_url of images in html partitioner <img> tags with non-data content include a new image_url metadata field with the content of the src attribute.
Use lxml instead of bs4 to parse hOCR data. lxml is much faster than bs4 given the hOCR data format is regular (garanteed because it is programatically generated)
bump numpy to >2. And upgrade paddlepaddle, unstructured-paddleocr, onnx so they are compatible with numpy>2.

Fixes

Fix Image in a
tag is "UncategorizedText" with no .text

What's Changed

feat: support extracting image url in html by @ryannikolaidis in #3955
feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
Feat/bump numpy to 2 by @badGarnet in #3961
Image within div or span with no text is annotated as Image by @ajjimeno in #3962

Full Changelog: 0.17.0...0.17.2

Contributors

badGarnet, ryannikolaidis, and ajjimeno

Assets 2