Releases · Unstructured-IO/unstructured-inference · GitHub

06 Oct 20:49

cragwolfe

0.7.2

0.7.2

Sort elements extracted by pdfminer to get consistent result from aggregate_by_block()

Assets 2

06 Oct 19:57

benjats07

0.7.1

Download yolox already quantized from HF
Pins onnxruntime<1.16

Assets 2

05 Oct 18:25

cragwolfe

0.7.0

0.7.0

Remove all OCR related code expect the table OCR code

Assets 2

27 Sep 23:52

yuming-long

0.6.6

0.6.6

Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
we have the mapping from standard language code to paddle language code.

Assets 2

27 Sep 00:51

cragwolfe

0.6.5

0.6.5

Add functionality to keep extracted image elements while merging inferred layout with extracted layout
Fix source property for elements generated by pdfminer.
Add 'OCR-tesseract' and 'OCR-paddle' as sources for elements generated by OCR.

Assets 2

27 Sep 00:42

cragwolfe

0.6.4

0.6.4

add a function to automatically scale table crop images based on text height so the text height is optimum for tesseract OCR task
add the new image auto scaling parameters to config.py

Assets 2

25 Sep 20:20

badGarnet

0.6.3

What's Changed

feat: make table transformer parameters configurable by @badGarnet in #224
feat: add pre commit hook by @badGarnet in #220

Bug fixes

fix: padded boxes are not rescaled/shifted correctly by @badGarnet in #229

Full Changelog: 0.6.1...0.6.3

Contributors

badGarnet

Assets 2

21 Sep 16:38

badGarnet

0.6.1

What's Changed

feat: add config class by @badGarnet in #218 This change allows a user to specific inference parameters via environment variables.
Fix/overlapping of bboxes by @benjats07 in #201 This change makes yolox the default model for element detection and removes duplicated or near duplicated bounding boxes in the results to reduce noise in the final elements.

Full Changelog: 0.5.31...0.6.1

Contributors

badGarnet and benjats07

Assets 2

21 Sep 05:10

cragwolfe

0.5.31

0.5.31

Add functionality to extract and save images from the page
Add functionality to get only "true" embedded images when extracting elements from PDF pages
Update the layout visualization script to be able to show only image elements if need
add an evaluation metric for table comparison based on token similarity
fix paddle unit tests where make test fails since paddle doesn't work on M1/M2 chip locally

Assets 2

14 Sep 23:13

yuming-long

0.5.28

0.5.28

add env variable ENTIRE_PAGE_OCR to specify using paddle or tesseract on entire page OCR

Assets 2