Skip to content

Commit bdfd975

Browse files
ds-filipknefelFilip Knefelryannikolaidisron-unstructured
authored
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with [this](Unstructured-IO/unstructured-api#370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: ds-filipknefel <[email protected]> Co-authored-by: Ronny H <[email protected]>
1 parent 4ff6a5b commit bdfd975

File tree

16 files changed

+55
-41
lines changed

16 files changed

+55
-41
lines changed

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
## 0.12.7-dev9
1+
## 0.13.0-dev10
22

3-
### Enhancements
3+
### Enhancements
44

55
* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
66
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
@@ -13,6 +13,7 @@
1313
### Fixes
1414

1515
* **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
16+
* **Change table extraction defaults** Change table extraction defaults in favor of using `skip_infer_table_types` parameter and reflect these changes in documentation.
1617
* **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
1718
* **Adds tracking for AstraDB** Adds tracking info so AstraDB can see what source called their api.
1819
* **Support AWS Bedrock Embeddings in ingest CLI** The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
@@ -66,6 +67,7 @@
6667
* **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.**
6768
* **Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.
6869

70+
6971
## 0.12.4
7072

7173
### Enhancements

docs/source/apis/api_parameters.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ languages
6060
pdf_infer_table_structure
6161
-------------------------
6262
- **Type**: boolean
63-
- **Description**: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, 'text_as_html'.
63+
- **Description**: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
6464

6565
skip_infer_table_types
6666
----------------------

docs/source/apis/usage_methods.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Method 1: Partition via API (``partition_via_api``)
2424

2525
filename = "example-docs/DA-1p.pdf"
2626
elements = partition_via_api(
27-
filename=filename, api_key="MY_API_KEY", strategy="auto", pdf_infer_table_structure="true"
27+
filename=filename, api_key="MY_API_KEY", strategy="auto"
2828
)
2929

3030
- **Self-Hosting or Local API**::

docs/source/best_practices/table_extraction_pdf.rst

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
3333
Method 2: Using Auto Partition or Unstructured API
3434
--------------------------------------------------
3535

36-
By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.
36+
By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.
3737

3838

3939
**Usage: Auto Partition**
@@ -46,7 +46,6 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
4646
4747
elements = partition(filename=filename,
4848
strategy='hi_res',
49-
skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
5049
)
5150
5251
tables = [el for el in elements if el.category == "Table"]
@@ -65,9 +64,4 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
6564
-H 'Content-Type: multipart/form-data' \
6665
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
6766
-F 'strategy=hi_res' \
68-
-F 'skip_infer_table_types=[]' \
6967
| jq -C . | less -R
70-
71-
.. warning::
72-
73-
You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.

docs/source/core/partition.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -872,7 +872,7 @@ settings supported by the API.
872872
filename = "example-docs/DA-1p.pdf"
873873
874874
elements = partition_via_api(
875-
filename=filename, api_key=api_key, strategy="auto", pdf_infer_table_structure="true"
875+
filename=filename, api_key=api_key, strategy="auto"
876876
)
877877
878878
If you are using the `Unstructured SaaS API <https://unstructured-io.github.io/unstructured/apis/saas_api.html>`__, you can use the ``api_url`` kwarg to point the ``partition_via_api`` function at your Unstructured SaaS API URL.

docs/source/examples/databricks.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,6 @@ Extracting PDF Using Unstructured Python SDK
4747
),
4848
# Other partition params
4949
strategy="hi_res",
50-
pdf_infer_table_structure=True,
5150
chunking_strategy="by_title",
5251
)
5352

docs/source/examples/dict_to_elements.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ Configure and run the S3Runner for processing the data.
7474
api_key=UNSTRUCTURED_API_KEY,
7575
strategy="hi_res",
7676
hi_res_model_name="yolox",
77-
pdf_infer_table_structure=True,
7877
),
7978
fsspec_config=FsspecConfig(
8079
remote_url=S3_URL,

docs/source/ingest/configs/partition_config.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ responsible for coordinating data after processing, including the dynamic metada
99
Configs for Partitioning
1010
-------------------------
1111

12-
* ``pdf_infer_table_structure``: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, "text_as_html," where the value (string) is a just a transformation of the data into an HTML <table>. The "text" field for a partitioned Table Element is always present, whether True or False.
12+
* ``pdf_infer_table_structure``: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
1313
* ``skip_infer_table_types``: List of document types that you want to skip table extraction with.
1414
* ``strategy (default auto)``: The strategy to use for partitioning PDF/image. Uses a layout detection model if set to 'hi_res', otherwise partition simply extracts the text from the document and processes it.
1515
* ``ocr_languages``: The languages present in the document, for use in partitioning and/or OCR. For partitioning image or pdf documents with Tesseract, you'll first need to install the appropriate Tesseract language pack if running via local unstructured library. For other partitions, language is detected using naive Bayesian filter via `langdetect`. Multiple languages indicates text could be in either language.

test_unstructured/partition/test_auto.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -356,7 +356,7 @@ def test_auto_partition_pdf_with_fast_strategy(monkeypatch):
356356
languages=None,
357357
metadata_filename=None,
358358
include_page_breaks=False,
359-
infer_table_structure=False,
359+
infer_table_structure=True,
360360
extract_images_in_pdf=False,
361361
extract_image_block_types=None,
362362
extract_image_block_output_dir=None,

test_unstructured_ingest/expected-structured-output/Sharepoint-with-permissions/Shared Documents/stanley-cups.xlsx.json

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
"eng"
1919
],
2020
"page_name": "Stanley Cups",
21-
"page_number": 1
21+
"page_number": 1,
22+
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
2223
},
2324
"text": "Stanley Cups",
2425
"type": "Title"
@@ -42,7 +43,8 @@
4243
"eng"
4344
],
4445
"page_name": "Stanley Cups",
45-
"page_number": 1
46+
"page_number": 1,
47+
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
4648
},
4749
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
4850
"type": "Table"
@@ -66,7 +68,8 @@
6668
"eng"
6769
],
6870
"page_name": "Stanley Cups Since 67",
69-
"page_number": 2
71+
"page_number": 2,
72+
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
7073
},
7174
"text": "Stanley Cups Since 67",
7275
"type": "Title"
@@ -90,7 +93,8 @@
9093
"eng"
9194
],
9295
"page_name": "Stanley Cups Since 67",
93-
"page_number": 2
96+
"page_number": 2,
97+
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
9498
},
9599
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
96100
"type": "Table"

0 commit comments

Comments
 (0)