Skip to content

Commit d471949

Browse files
authored
Chore: fix bug caused by page break has no page number (#196)
* fix page break
1 parent 438819d commit d471949

File tree

5 files changed

+14
-6
lines changed

5 files changed

+14
-6
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.0.38-dev0
2+
3+
* Fix page break has None page number bug
4+
15
## 0.0.37
26

37
* Bump unstructured to 0.10.4

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ To extract the table structure from PDF files using the `hi_res` strategy, ensur
115115
#### Skip Table Extraction
116116

117117
Currently, we provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we skip table extraction
118-
for PDFs and Images, which are `pdf`, `jpg` and `png`. Again, please note that table extraction only works with `hi_res` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to `skip_infer_table_types`with:
118+
for PDFs and Images, which are `pdf`, `jpg` and `png`. Again, please note that table extraction only works with `hi_res` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to `skip_infer_table_types` with:
119119

120120
```
121121
curl -X 'POST' \
@@ -124,7 +124,7 @@ for PDFs and Images, which are `pdf`, `jpg` and `png`. Again, please note that t
124124
-H 'Content-Type: multipart/form-data' \
125125
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
126126
-F 'strategy=hi_res' \
127-
-F 'skip_infer_table_types=' \
127+
-F 'skip_infer_table_types=[]' \
128128
| jq -C . | less -R
129129
```
130130

pipeline-notebooks/pipeline-general.ipynb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -673,7 +673,9 @@
673673
"\n",
674674
" # We need to account for the original page numbers\n",
675675
" for element in elements:\n",
676-
" element.metadata.page_number += page_offset\n",
676+
" if element.metadata.page_number:\n",
677+
" # Page number could be None if we include page breaks\n",
678+
" element.metadata.page_number += page_offset\n",
677679
"\n",
678680
" return elements\n",
679681
"\n",

prepline_general/api/general.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,9 @@ def partition_file_via_api(file_tuple, request, filename, content_type, **partit
144144

145145
# We need to account for the original page numbers
146146
for element in elements:
147-
element.metadata.page_number += page_offset
147+
if element.metadata.page_number:
148+
# Page number could be None if we include page breaks
149+
element.metadata.page_number += page_offset
148150

149151
return elements
150152

@@ -484,7 +486,7 @@ def return_content_type(filename):
484486

485487

486488
@router.post("/general/v0/general")
487-
@router.post("/general/v0.0.37/general")
489+
@router.post("/general/v0.0.38/general")
488490
def pipeline_1(
489491
request: Request,
490492
gz_uncompressed_content_type: Optional[str] = Form(default=None),

preprocessing-pipeline-family.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
name: general
2-
version: 0.0.37
2+
version: 0.0.38

0 commit comments

Comments
 (0)