chore: change table extraction defaults (#2588)

ds-filipknefel · Filip Knefel · ryannikolaidis · web-flow · commit bdfd975115c2 · 2024-03-22T10:08:49.000Z
Change default values for table extraction - works in pair with [this](Unstructured-IO/unstructured-api#370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,6 @@
-## 0.12.7-dev9
+## 0.13.0-dev10
 
-### Enhancements 
+### Enhancements
 
 * **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
 * **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
@@ -13,6 +13,7 @@
 ### Fixes
 
 * **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
+* **Change table extraction defaults** Change table extraction defaults in favor of using `skip_infer_table_types` parameter and reflect these changes in documentation.
 * **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
 * **Adds tracking for AstraDB** Adds tracking info so AstraDB can see what source called their api.
 * **Support AWS Bedrock Embeddings in ingest CLI** The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
@@ -66,6 +67,7 @@
 * **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.**
 * **Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.
 
+
 ## 0.12.4
 
 ### Enhancements
diff --git a/docs/source/apis/api_parameters.rst b/docs/source/apis/api_parameters.rst
@@ -60,7 +60,7 @@ languages
 pdf_infer_table_structure
 -------------------------
 - **Type**: boolean
-- **Description**: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, 'text_as_html'.
+- **Description**: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
 
 skip_infer_table_types
 ----------------------
diff --git a/docs/source/apis/usage_methods.rst b/docs/source/apis/usage_methods.rst
@@ -24,7 +24,7 @@ Method 1: Partition via API (``partition_via_api``)
 
       filename = "example-docs/DA-1p.pdf"
       elements = partition_via_api(
-        filename=filename, api_key="MY_API_KEY", strategy="auto", pdf_infer_table_structure="true"
+        filename=filename, api_key="MY_API_KEY", strategy="auto"
       )
 
   - **Self-Hosting or Local API**::
diff --git a/docs/source/best_practices/table_extraction_pdf.rst b/docs/source/best_practices/table_extraction_pdf.rst
@@ -33,7 +33,7 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
 Method 2: Using Auto Partition or Unstructured API
 --------------------------------------------------
 
-By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.
+By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.
 
 
 **Usage: Auto Partition**
@@ -46,7 +46,6 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
 
     elements = partition(filename=filename,
                          strategy='hi_res',
-                         skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
                )
 
     tables = [el for el in elements if el.category == "Table"]
@@ -65,9 +64,4 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
           -H 'Content-Type: multipart/form-data' \
           -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
           -F 'strategy=hi_res' \
-          -F 'skip_infer_table_types=[]' \
           | jq -C . | less -R
-
-.. warning::
-
-    You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
diff --git a/docs/source/core/partition.rst b/docs/source/core/partition.rst
@@ -872,7 +872,7 @@ settings supported by the API.
   filename = "example-docs/DA-1p.pdf"
 
   elements = partition_via_api(
-    filename=filename, api_key=api_key, strategy="auto", pdf_infer_table_structure="true"
+    filename=filename, api_key=api_key, strategy="auto"
   )
 
 If you are using the `Unstructured SaaS API <https://unstructured-io.github.io/unstructured/apis/saas_api.html>`__, you can use the ``api_url`` kwarg to point the ``partition_via_api`` function at your Unstructured SaaS API URL.
diff --git a/docs/source/examples/databricks.rst b/docs/source/examples/databricks.rst
@@ -47,7 +47,6 @@ Extracting PDF Using Unstructured Python SDK
        ),
        # Other partition params
        strategy="hi_res",
-       pdf_infer_table_structure=True,
        chunking_strategy="by_title",
    )
 
diff --git a/docs/source/examples/dict_to_elements.rst b/docs/source/examples/dict_to_elements.rst
@@ -74,7 +74,6 @@ Configure and run the S3Runner for processing the data.
             api_key=UNSTRUCTURED_API_KEY,
             strategy="hi_res",
             hi_res_model_name="yolox",
-            pdf_infer_table_structure=True,
         ),
         fsspec_config=FsspecConfig(
             remote_url=S3_URL,
diff --git a/docs/source/ingest/configs/partition_config.rst b/docs/source/ingest/configs/partition_config.rst
@@ -9,7 +9,7 @@ responsible for coordinating data after processing, including the dynamic metada
 Configs for Partitioning
 -------------------------
 
-* ``pdf_infer_table_structure``: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, "text_as_html," where the value (string) is a just a transformation of the data into an HTML <table>. The "text" field for a partitioned Table Element is always present, whether True or False.
+* ``pdf_infer_table_structure``: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
 * ``skip_infer_table_types``: List of document types that you want to skip table extraction with.
 * ``strategy (default auto)``: The strategy to use for partitioning PDF/image. Uses a layout detection model if set to 'hi_res', otherwise partition simply extracts the text from the document and processes it.
 * ``ocr_languages``: The languages present in the document, for use in partitioning and/or OCR. For partitioning image or pdf documents with Tesseract, you'll first need to install the appropriate Tesseract language pack if running via local unstructured library. For other partitions, language is detected using naive Bayesian filter via `langdetect`. Multiple languages indicates text could be in either language.
diff --git a/test_unstructured/partition/test_auto.py b/test_unstructured/partition/test_auto.py
@@ -356,7 +356,7 @@ def test_auto_partition_pdf_with_fast_strategy(monkeypatch):
         languages=None,
         metadata_filename=None,
         include_page_breaks=False,
-        infer_table_structure=False,
+        infer_table_structure=True,
         extract_images_in_pdf=False,
         extract_image_block_types=None,
         extract_image_block_output_dir=None,
diff --git a/test_unstructured_ingest/expected-structured-output/Sharepoint-with-permissions/Shared Documents/stanley-cups.xlsx.json b/test_unstructured_ingest/expected-structured-output/Sharepoint-with-permissions/Shared Documents/stanley-cups.xlsx.json
@@ -18,7 +18,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups",
     "type": "Title"
@@ -42,7 +43,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
     "type": "Table"
@@ -66,7 +68,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups Since 67",
     "type": "Title"
@@ -90,7 +93,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
     "type": "Table"
diff --git a/test_unstructured_ingest/expected-structured-output/Sharepoint/Shared Documents/stanley-cups.xlsx.json b/test_unstructured_ingest/expected-structured-output/Sharepoint/Shared Documents/stanley-cups.xlsx.json
@@ -18,7 +18,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups",
     "type": "Title"
@@ -42,7 +43,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
     "type": "Table"
@@ -66,7 +68,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups Since 67",
     "type": "Title"
@@ -90,7 +93,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
     "type": "Table"
diff --git a/test_unstructured_ingest/expected-structured-output/gcs/nested-2/stanley-cups.xlsx.json b/test_unstructured_ingest/expected-structured-output/gcs/nested-2/stanley-cups.xlsx.json
@@ -17,7 +17,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups",
     "type": "Title"
@@ -40,7 +41,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups",
-      "page_number": 1
+      "page_number": 1,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
     "type": "Table"
@@ -63,7 +65,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "Stanley Cups Since 67",
     "type": "Title"
@@ -86,7 +89,8 @@
         "eng"
       ],
       "page_name": "Stanley Cups Since 67",
-      "page_number": 2
+      "page_number": 2,
+      "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>"
     },
     "text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
     "type": "Table"
diff --git a/test_unstructured_ingest/expected-structured-output/onedrive/utic-test-ingest-fixtures/tests-example.xls.json b/test_unstructured_ingest/expected-structured-output/onedrive/utic-test-ingest-fixtures/tests-example.xls.json
diff --git a/unstructured/__version__.py b/unstructured/__version__.py
diff --git a/unstructured/ingest/interfaces.py b/unstructured/ingest/interfaces.py
diff --git a/unstructured/partition/auto.py b/unstructured/partition/auto.py

Original file line number	Diff line number	Diff line change
@@ -24,7 +24,7 @@ Method 1: Partition via API (``partition_via_api``)
`24`	`24`
`25`	`25`	`filename = "example-docs/DA-1p.pdf"`
`26`	`26`	`elements = partition_via_api(`
`27`		`- filename=filename, api_key="MY_API_KEY", strategy="auto", pdf_infer_table_structure="true"`
	`27`	`+ filename=filename, api_key="MY_API_KEY", strategy="auto"`
`28`	`28`	`)`
`29`	`29`
`30`	`30`	`- Self-Hosting or Local API::`
Original file line number	Diff line number	Diff line change
`@@ -872,7 +872,7 @@ settings supported by the API.`
`872`	`872`	`filename = "example-docs/DA-1p.pdf"`
`873`	`873`
`874`	`874`	`elements = partition_via_api(`
`875`		`- filename=filename, api_key=api_key, strategy="auto", pdf_infer_table_structure="true"`
	`875`	`+ filename=filename, api_key=api_key, strategy="auto"`
`876`	`876`	`)`
`877`	`877`
`878`	`878`	If you are using the `Unstructured SaaS API <https://unstructured-io.github.io/unstructured/apis/saas_api.html>`__, you can use the ``api_url`` kwarg to point the ``partition_via_api`` function at your Unstructured SaaS API URL.
Original file line number	Diff line number	Diff line change
`@@ -47,7 +47,6 @@ Extracting PDF Using Unstructured Python SDK`
`47`	`47`	`),`
`48`	`48`	`# Other partition params`
`49`	`49`	`strategy="hi_res",`
`50`		`- pdf_infer_table_structure=True,`
`51`	`50`	`chunking_strategy="by_title",`
`52`	`51`	`)`
`53`	`52`