0.17.2
Enhancements
-
Add image_url of images in html partitioner
<img>tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxmlinstead ofbs4to parse hOCR data.lxmlis much faster thanbs4given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpyto>2. And upgradepaddlepaddle,unstructured-paddleocr,onnxso they are compatible withnumpy>2.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2