Skip to content

Releases: Unstructured-IO/unstructured-inference

0.7.2

06 Oct 20:49
fd0cc54

Choose a tag to compare

0.7.2

  • Sort elements extracted by pdfminer to get consistent result from aggregate_by_block()

0.7.1

06 Oct 19:57
66fb179

Choose a tag to compare

  • Download yolox already quantized from HF
  • Pins onnxruntime<1.16

0.7.0

05 Oct 18:25
ffb1f0b

Choose a tag to compare

0.7.0

  • Remove all OCR related code expect the table OCR code

0.6.6

27 Sep 23:52
cf15726

Choose a tag to compare

0.6.6

  • Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
    we have the mapping from standard language code to paddle language code.

0.6.5

27 Sep 00:51
12ca9d9

Choose a tag to compare

0.6.5

  • Add functionality to keep extracted image elements while merging inferred layout with extracted layout
  • Fix source property for elements generated by pdfminer.
  • Add 'OCR-tesseract' and 'OCR-paddle' as sources for elements generated by OCR.

0.6.4

27 Sep 00:42
c4d3e8b

Choose a tag to compare

0.6.4

  • add a function to automatically scale table crop images based on text height so the text height is optimum for tesseract OCR task
  • add the new image auto scaling parameters to config.py

0.6.3

25 Sep 20:20
cb2aff2

Choose a tag to compare

What's Changed

Bug fixes

  • fix: padded boxes are not rescaled/shifted correctly by @badGarnet in #229

Full Changelog: 0.6.1...0.6.3

0.6.1

21 Sep 16:38
eaa8d65

Choose a tag to compare

What's Changed

  • feat: add config class by @badGarnet in #218 This change allows a user to specific inference parameters via environment variables.
  • Fix/overlapping of bboxes by @benjats07 in #201 This change makes yolox the default model for element detection and removes duplicated or near duplicated bounding boxes in the results to reduce noise in the final elements.

Full Changelog: 0.5.31...0.6.1

0.5.31

21 Sep 05:10
b9f032c

Choose a tag to compare

0.5.31

  • Add functionality to extract and save images from the page
  • Add functionality to get only "true" embedded images when extracting elements from PDF pages
  • Update the layout visualization script to be able to show only image elements if need
  • add an evaluation metric for table comparison based on token similarity
  • fix paddle unit tests where make test fails since paddle doesn't work on M1/M2 chip locally

0.5.28

14 Sep 23:13
173f633

Choose a tag to compare

0.5.28

  • add env variable ENTIRE_PAGE_OCR to specify using paddle or tesseract on entire page OCR