Skip to content

Commit fd0cc54

Browse files
Fix: sort elements extracted by pdfminer (#244)
### Summary - sort elements extracted by `pdfminer` to get consistent results from `aggregate_by_block()` ### Testing PDF: [recalibrating-risk-report_4-4.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12835342/recalibrating-risk-report_4-4.pdf) ``` f_path = "dist/docs/recalibrating-risk-report_4-4.pdf" layout = process_file_with_model( filename=f_path, model_name=None, ) elements = layout.pages[0].elements print("\n\n".join([str(el) for el in elements])) print(len(elements)) ```
1 parent 66fb179 commit fd0cc54

File tree

3 files changed

+7
-1
lines changed

3 files changed

+7
-1
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.7.2
2+
3+
* Sort elements extracted by `pdfminer` to get consistent result from `aggregate_by_block()`
4+
15
## 0.7.1
26

37
* Download yolox_quantized from HF
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.7.1" # pragma: no cover
1+
__version__ = "0.7.2" # pragma: no cover

unstructured_inference/inference/layout.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -570,6 +570,8 @@ def load_pdf(
570570

571571
if text_region.area > 0:
572572
layout.append(text_region)
573+
574+
layout = order_layout(layout)
573575
layouts.append(layout)
574576

575577
if path_only and not output_folder:

0 commit comments

Comments
 (0)