Skip to content

Commit 1b160fe

Browse files
authored
Deprecate infer_table_structure and pdf_infer_table_structure in favor of skip_infer_table_types (#539)
1 parent abcce8e commit 1b160fe

File tree

4 files changed

+16
-20
lines changed

4 files changed

+16
-20
lines changed

api-reference/partition/api-parameters.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,20 @@ The only required parameter is `files` - the file you wish to process.
1212
| POST, Python | JavaScript/TypeScript | Description |
1313
|-------------------------------------------|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1414
| `files` (_shared.Files_) | `files` (_File_, _Blob_, _shared.Files_) | The file to process. |
15-
| `chunking_strategy` (_str_) | `chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When no chunking strategy is specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `basic`, `by_title`, `by_page`, and `by_similarity`. [Learn more](/api-reference/partition/chunking). |
15+
| `chunking_strategy` (_str_) | `chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When no chunking strategy is specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `basic`, `by_title`, `by_page`, and `by_similarity`. [Learn more](/api-reference/partition/chunking). |
1616
| `content_type` (_str_) | `contentType` (_string_) | A hint to Unstructured about the content type to use (such as `text/markdown`), when there are problems processing a specific file. This value is a MIME type in the format `type/subtype`. For available MIME types, see [model.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/model.py). |
17-
| `coordinates` (_bool_) | `coordinates` (_boolean_) | True to return bounding box coordinates for each element extracted with OCR. Default: false. [Learn more](/api-reference/partition/examples#saving-bounding-box-coordinates). |
17+
| `coordinates` (_bool_) | `coordinates` (_boolean_) | True to return bounding box coordinates for each element extracted with OCR. Default: false. [Learn more](/api-reference/partition/examples#saving-bounding-box-coordinates). |
1818
| `encoding` (_str_) | `encoding` (_string_) | The encoding method used to decode the text input. Default: `utf-8`. |
19-
| `extract_image_block_types` (_List[str]_) | `extractImageBlockTypes` (_string[]_) | The types of elements to extract, for use in extracting image blocks as Base64 encoded data stored in element metadata fields, for example: `["Image","Table"]`. Supported filetypes are image and PDF. [Learn more](/api-reference/partition/extract-image-block-types). |
19+
| `extract_image_block_types` (_List[str]_) | `extractImageBlockTypes` (_string[]_) | The types of elements to extract, for use in extracting image blocks as Base64 encoded data stored in element metadata fields, for example: `["Image","Table"]`. Supported filetypes are image and PDF. [Learn more](/api-reference/partition/extract-image-block-types). |
2020
| `gz_uncompressed_content_type` (_str_) | `gzUncompressedContentType` (_string_) | If file is gzipped, use this content type after unzipping. Example: `application/pdf` |
21-
| `hi_res_model_name` (_str_) | `hiResModelName` (_string_) | The name of the inference model used when strategy is `hi_res`. Options are `layout_v1.1.0` and `yolox`. Default: `layout_v1.1.0`. [Learn more](/api-reference/partition/examples#changing-partition-strategy-for-a-pdf). |
21+
| `hi_res_model_name` (_str_) | `hiResModelName` (_string_) | The name of the inference model used when strategy is `hi_res`. Options are `layout_v1.1.0` and `yolox`. Default: `layout_v1.1.0`. [Learn more](/api-reference/partition/examples#changing-partition-strategy-for-a-pdf). |
2222
| `include_page_breaks` (_bool_) | `includePageBreaks` (_boolean_) | True for the output to include page breaks if the filetype supports it. Default: false. |
23-
| `languages` (_List[str]_) | `languages` (_string[]_) | The languages present in the document, for use in partitioning and OCR. [View the list of available languages](https://github.com/tesseract-ocr/tessdata). [Learn more](/api-reference/partition/examples#specifying-the-language-of-a-document-for-better-ocr-results). |
23+
| `languages` (_List[str]_) | `languages` (_string[]_) | The languages present in the document, for use in partitioning and OCR. [View the list of available languages](https://github.com/tesseract-ocr/tessdata). [Learn more](/api-reference/partition/examples#specifying-the-language-of-a-document-for-better-ocr-results). |
2424
| `output_format` (_str_) | `outputFormat` (_string_) | The format of the response. Supported formats are `application/json` and `text/csv`. Default: `application/json`. |
25-
| `pdf_infer_table_structure` (_bool_) | `pdfInferTableStructure` (_boolean_) | **Deprecated!** If true and `strategy` is `hi_res`, any `Table` elements extracted from a PDF will include an additional metadata field, `text_as_html`, where the value (string) is a just a transformation of the data into an HTML table. |
25+
| `pdf_infer_table_structure` (_bool_) | `pdfInferTableStructure` (_boolean_) | **Deprecated!** Use `skip_infer_table_types` instead. If true and `strategy` is `hi_res`, any `Table` elements extracted from a PDF will include an additional metadata field, `text_as_html`, where the value (string) is a just a transformation of the data into an HTML table. |
2626
| `skip_infer_table_types` (_List[str]_) | `skipInferTableTypes` (_string[]_) | The document types that you want to skip table extraction for. Default: `[]`. |
2727
| `starting_page_number` (_int_) | `startingPageNumber` (_number_) | The page number to be be assigned to the first page in the document. This information will be included in elements' metadata and can be be especially useful when partitioning a document that is part of a larger document. |
28-
| `strategy` (_str_) | `strategy` (_string_) | The strategy to use for partitioning PDF and image files. Options are `auto`, `vlm`, `hi_res`, `fast`, and `ocr_only`. Default: `auto`. [Learn more](/api-reference/partition/partitioning). |
28+
| `strategy` (_str_) | `strategy` (_string_) | The strategy to use for partitioning PDF and image files. Options are `auto`, `vlm`, `hi_res`, `fast`, and `ocr_only`. Default: `auto`. [Learn more](/api-reference/partition/partitioning). |
2929
| `unique_element_ids` (_bool_) | `uniqueElementIds` (_boolean_) | True to assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of the element's text is used. Default: false. |
3030
| `vlm_model` (_str_) | (Not yet available) | Applies only when `strategy` is `vlm`. The name of the vision language model (VLM) provider to use for partitioning. `vlm_model_provider` must also be specified. For a list of allowed values, see the end of this article. |
3131
| `vlm_model_provider` (_str_) | (Not yet available) | Applies only when `strategy` is `vlm`. The name of the vision language model (VLM) to use for partitioning. `vlm_model` must also be specified. For a list of allowed values, see the end of this article. |
@@ -48,7 +48,7 @@ The following parameters are specific to the Python and JavaScript/TypeScript cl
4848

4949
| POST, Python | JavaScript/TypeScript | Description |
5050
|---------------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
51-
| `split_pdf_page` (_bool_) | `splitPdfPage` (_boolean_) | True to split the PDF file client-side. [Learn more](/api-reference/partition/sdk-python#page-splitting). |
51+
| `split_pdf_page` (_bool_) | `splitPdfPage` (_boolean_) | True to split the PDF file client-side. [Learn more](/api-reference/partition/sdk-python#page-splitting). |
5252
| `split_pdf_allow_failed` (_bool_) | `splitPdfAllowFailed` (_boolean_) | When `true`, a failed split request will not stop the processing of the rest of the document. The affected page range will be ignored in the results. When `false`, a failed split request will cause the entire document to fail. Default: `false`. |
5353
| `split_pdf_concurrency_level` (_int_) | `splitPdfConcurrencyLevel` (_number_) | The number of split files to be sent concurrently. Default: 5. Maximum: 15. |
5454
| `split_pdf_page_range` (_List[int]_) | `splitPdfPageRange` (_number[]_) | A list of 2 integers within the range `[1, length_of_pdf]`. When pdf splitting is enabled, this will send only the specified page range to the API. |

api-reference/workflow/workflows.mdx

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -708,7 +708,6 @@ Allowed values for `provider` and `model` include:
708708
settings={
709709
"strategy": "hi_res",
710710
"include_page_breaks": <True|False>,
711-
"pdf_infer_table_structure": <True|False>,
712711
"exclude_elements": [
713712
"<element-name>",
714713
"<element-name>"
@@ -723,7 +722,7 @@ Allowed values for `provider` and `model` include:
723722
"image",
724723
"table"
725724
],
726-
"infer_table_structure": <True|False>
725+
"skip_infer_table_types": <True|False>
727726
}
728727
)
729728
```
@@ -737,7 +736,6 @@ Allowed values for `provider` and `model` include:
737736
"settings": {
738737
"strategy": "hi_res",
739738
"include_page_breaks": <true|false>,
740-
"pdf_infer_table_structure": <true|false>,
741739
"exclude_elements": [
742740
"<element-name>",
743741
"<element-name>"
@@ -752,7 +750,7 @@ Allowed values for `provider` and `model` include:
752750
"image",
753751
"table"
754752
],
755-
"infer_table_structure": <true|false>
753+
"skip_infer_table_types": <true|false>
756754
}
757755
}
758756
```
@@ -771,7 +769,6 @@ Allowed values for `provider` and `model` include:
771769
settings={
772770
"strategy": "fast",
773771
"include_page_breaks": <True|False>,
774-
"pdf_infer_table_structure": <True|False>,
775772
"exclude_elements": [
776773
"<element-name>",
777774
"<element-name>"
@@ -786,7 +783,7 @@ Allowed values for `provider` and `model` include:
786783
"image",
787784
"table"
788785
],
789-
"infer_table_structure": <True|False>
786+
"skip_infer_table_types": <True|False>
790787
}
791788
)
792789
```
@@ -800,7 +797,6 @@ Allowed values for `provider` and `model` include:
800797
"settings": {
801798
"strategy": "fast",
802799
"include_page_breaks": <true|false>,
803-
"pdf_infer_table_structure": <true|false>,
804800
"exclude_elements": [
805801
"<element-name>",
806802
"<element-name>"
@@ -815,7 +811,7 @@ Allowed values for `provider` and `model` include:
815811
"image",
816812
"table"
817813
],
818-
"infer_table_structure": <true|false>
814+
"skip_infer_table_types": <true|false>
819815
}
820816
}
821817
```

examplecode/codesamples/apioss/table-extraction-from-pdf.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This sample code utilizes the [Unstructured Open Source](/open-source/introducti
99

1010
## Method 1: Using partition\_pdf
1111

12-
To extract the tables from PDF files using the [partition\_pdf](/open-source/core-functionality/partitioning#partition-pdf), set the `infer_table_structure` parameter to `True` and `strategy` parameter to `hi_res`.
12+
To extract the tables from PDF files using the [partition\_pdf](/open-source/core-functionality/partitioning#partition-pdf), set the `skip_infer_table_types` parameter to `False` and `strategy` parameter to `hi_res`.
1313

1414
**Usage**
1515

@@ -19,7 +19,7 @@ from unstructured.partition.pdf import partition_pdf
1919
fname = "example-docs/pdf/layout-parser-paper.pdf"
2020

2121
elements = partition_pdf(filename=fname,
22-
infer_table_structure=True,
22+
skip_infer_table_types=False,
2323
strategy='hi_res',
2424
)
2525

0 commit comments

Comments
 (0)