Skip to content

Commit af24905

Browse files
chore(deps): Bump unstructured[local-inference] from 0.10.14 to 0.10.15 in /requirements (#242)
Bumps [unstructured[local-inference]](https://github.com/Unstructured-IO/unstructured) from 0.10.14 to 0.10.15. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/Unstructured-IO/unstructured/releases">unstructured[local-inference]'s releases</a>.</em></p> <blockquote> <h2>0.10.15</h2> <h3>Enhancements</h3> <ul> <li><strong>Suport for better element categories from the next-generation image-to-text model (&quot;chipper&quot;).</strong>. Previously, not all of the classifications from Chipper were being mapped to proper <code>unstructured</code> element categories so the consumer of the library would see many <code>UncategorizedText</code> elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is: <ul> <li>&quot;Threading&quot;: <code>NarrativeText</code></li> <li>&quot;Form&quot;: <code>NarrativeText</code></li> <li>&quot;Field-Name&quot;: <code>Title</code></li> <li>&quot;Value&quot;: <code>NarrativeText</code></li> <li>&quot;Link&quot;: <code>NarrativeText</code></li> <li>&quot;Headline&quot;: <code>Title</code> (with <code>category_depth=1</code>)</li> <li>&quot;Subheadline&quot;: <code>Title</code> (with <code>category_depth=2</code>)</li> <li>&quot;Abstract&quot;: <code>NarrativeText</code></li> </ul> </li> <li><strong>Better ListItem grouping for PDF's (fast strategy).</strong> The <code>partition_pdf</code> with <code>fast</code> strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.</li> <li><strong>Fall back to text-based classification for uncategorized Layout elements for Images and PDF's</strong>. Improves element classification by running existing text-based rules on previously UncategorizedText elements</li> <li><strong>Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.</strong> At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where <code>Table</code> Elements are now propery extracted.</li> <li><strong>Create and add <code>add_chunking_strategy</code> decorator to partition functions.</strong> Previously, users were responsible for their own chunking after partitioning elements, often required for downstream applications. Now, individual elements may be combined into right-sized chunks where min and max character size may be specified if <code>chunking_strategy=by_title</code>. Relevant elements are grouped together for better downstream results. This enables users immediately use partitioned results effectively in downstream applications (e.g. RAG architecture apps) without any additional post-processing.</li> <li><strong>Adds <code>languages</code> as an input parameter and marks <code>ocr_languages</code> kwarg for deprecation in pdf, image, and auto partitioning functions.</strong> Previously, language information was only being used for Tesseract OCR for image-based documents and was in a Tesseract specific string format, but by refactoring into a list of standard language codes independent of Tesseract, the <code>unstructured</code> library will better support <code>languages</code> for other non-image pipelines and/or support for other OCR engines.</li> <li><strong>Removes <code>UNSTRUCTURED_LANGUAGE</code> env var usage and replaces <code>language</code> with <code>languages</code> as an input parameter to unstructured-partition-text_type functions.</strong> The previous parameter/input setup was not user-friendly or scalable to the variety of elements being processed. By refactoring the inputted language information into a list of standard language codes, we can support future applications of the element language such as detection, metadata, and multi-language elements. Now, to skip English specific checks, set the <code>languages</code> parameter to any non-English language(s).</li> <li><strong>Adds <code>xlsx</code> and <code>xls</code> filetype extensions to the <code>skip_infer_table_types</code> default list in <code>partition</code>.</strong> By adding these file types to the input parameter these files should not go through table extraction. Users can still specify if they would like to extract tables from these filetypes, but will have to set the <code>skip_infer_table_types</code> to exclude the desired filetype extension. This avoids mis-representing complex spreadsheets where there may be multiple sub-tables and other content.</li> <li><strong>Better debug output related to sentence counting internals</strong>. Clarify message when sentence is not counted toward sentence count because there aren't enough words, relevant for developers focused on <code>unstructured</code>s NLP internals.</li> <li><strong>Faster ocr_only speed for partitioning PDF and images.</strong> Use <code>unstructured_pytesseract.run_and_get_multiple_output</code> function to reduce the number of calls to <code>tesseract</code> by half when partitioning pdf or image with <code>tesseract</code></li> <li><strong>Adds data source properties to fsspec connectors</strong> These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.</li> <li><strong>Add delta table destination connector</strong> New delta table destination connector added to ingest CLI. Users may now use <code>unstructured-ingest</code> to write partitioned data from over 20 data sources (so far) to a Delta Table.</li> <li><strong>Rename to Source and Destination Connectors in the Documentation.</strong> Maintain naming consistency between Connectors codebase and documentation with the first addition to a destination connector.</li> <li><strong>Non-HTML text files now return unstructured-elements as opposed to HTML-elements.</strong> Previously the text based files that went through <code>partition_html</code> would return HTML-elements but now we preserve the format from the input using <code>source_format</code> argument in the partition call.</li> <li><strong>Adds <code>PaddleOCR</code> as an optional alternative to <code>Tesseract</code></strong> for OCR in processing of PDF or Image files, it is installable via the <code>makefile</code> command <code>install-paddleocr</code>. For experimental purposes only.</li> <li><strong>Bump unstructured-inference</strong> to 0.5.28. This version bump markedly improves the output of table data, rendered as <code>metadata.text_as_html</code> in an element. These changes include: <ul> <li>add env variable <code>ENTIRE_PAGE_OCR</code> to specify using paddle or tesseract on entire page OCR</li> <li>table structure detection now pads the input image by 25 pixels in all 4 directions to improve its recall (0.5.27)</li> <li>support paddle with both cpu and gpu and assumed it is pre-installed (0.5.26)</li> <li>fix a bug where <code>cells_to_html</code> doesn't handle cells spanning multiple rows properly (0.5.25)</li> <li>remove <code>cv2</code> preprocessing step before OCR step in table transformer (0.5.24)</li> </ul> </li> </ul> <h3>Features</h3> <ul> <li><strong>Adds element metadata via <code>category_depth</code> with default value None</strong>. <ul> <li>This additional metadata is useful for vectordb/LLM, chunking strategies, and retrieval applications.</li> </ul> </li> <li><strong>Adds a naive hierarchy for elements via a <code>parent_id</code> on the element's metadata</strong> <ul> <li>Users will now have more metadata for implementing vectordb/LLM chunking strategies. For example, text elements could be queried by their preceding title element.</li> <li>Title elements created from HTML headings will properly nest</li> </ul> </li> </ul> <h3>Fixes</h3> <ul> <li><strong><code>add_pytesseract_bboxes_to_elements</code> no longer returns <code>nan</code> values</strong>. The function logic is now broken into new methods <code>_get_element_box</code> and <code>convert_multiple_coordinates_to_new_system</code></li> <li><strong>Selecting a different model wasn't being respected when calling <code>partition_image</code>.</strong> Problem: <code>partition_pdf</code> allows for passing a <code>model_name</code> parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that <code>partition_image</code> should support the same parameter, but <code>partition_image</code> was unintentionally not passing along its <code>kwargs</code>. This was corrected by adding the kwargs to the downstream call.</li> <li><strong>Fixes a chunking issue via dropping the field &quot;coordinates&quot;.</strong> Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with &quot;coordinates&quot; essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key &quot;coordinates&quot; inside a list of excluded metadata keys, while doing this &quot;metadata_matches&quot; comparision. Importance: This change is crucial to be able to chunk by title for documents which include &quot;coordinates&quot; metadata in their elements.</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md">unstructured[local-inference]'s changelog</a>.</em></p> <blockquote> <h2>0.10.15</h2> <h3>Enhancements</h3> <ul> <li><strong>Suport for better element categories from the next-generation image-to-text model (&quot;chipper&quot;).</strong>. Previously, not all of the classifications from Chipper were being mapped to proper <code>unstructured</code> element categories so the consumer of the library would see many <code>UncategorizedText</code> elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is: <ul> <li>&quot;Threading&quot;: <code>NarrativeText</code></li> <li>&quot;Form&quot;: <code>NarrativeText</code></li> <li>&quot;Field-Name&quot;: <code>Title</code></li> <li>&quot;Value&quot;: <code>NarrativeText</code></li> <li>&quot;Link&quot;: <code>NarrativeText</code></li> <li>&quot;Headline&quot;: <code>Title</code> (with <code>category_depth=1</code>)</li> <li>&quot;Subheadline&quot;: <code>Title</code> (with <code>category_depth=2</code>)</li> <li>&quot;Abstract&quot;: <code>NarrativeText</code></li> </ul> </li> <li><strong>Better ListItem grouping for PDF's (fast strategy).</strong> The <code>partition_pdf</code> with <code>fast</code> strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.</li> <li><strong>Fall back to text-based classification for uncategorized Layout elements for Images and PDF's</strong>. Improves element classification by running existing text-based rules on previously UncategorizedText elements</li> <li><strong>Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg.</strong> At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where <code>Table</code> Elements are now propery extracted.</li> <li><strong>Create and add <code>add_chunking_strategy</code> decorator to partition functions.</strong> Previously, users were responsible for their own chunking after partitioning elements, often required for downstream applications. Now, individual elements may be combined into right-sized chunks where min and max character size may be specified if <code>chunking_strategy=by_title</code>. Relevant elements are grouped together for better downstream results. This enables users immediately use partitioned results effectively in downstream applications (e.g. RAG architecture apps) without any additional post-processing.</li> <li><strong>Adds <code>languages</code> as an input parameter and marks <code>ocr_languages</code> kwarg for deprecation in pdf, image, and auto partitioning functions.</strong> Previously, language information was only being used for Tesseract OCR for image-based documents and was in a Tesseract specific string format, but by refactoring into a list of standard language codes independent of Tesseract, the <code>unstructured</code> library will better support <code>languages</code> for other non-image pipelines and/or support for other OCR engines.</li> <li><strong>Removes <code>UNSTRUCTURED_LANGUAGE</code> env var usage and replaces <code>language</code> with <code>languages</code> as an input parameter to unstructured-partition-text_type functions.</strong> The previous parameter/input setup was not user-friendly or scalable to the variety of elements being processed. By refactoring the inputted language information into a list of standard language codes, we can support future applications of the element language such as detection, metadata, and multi-language elements. Now, to skip English specific checks, set the <code>languages</code> parameter to any non-English language(s).</li> <li><strong>Adds <code>xlsx</code> and <code>xls</code> filetype extensions to the <code>skip_infer_table_types</code> default list in <code>partition</code>.</strong> By adding these file types to the input parameter these files should not go through table extraction. Users can still specify if they would like to extract tables from these filetypes, but will have to set the <code>skip_infer_table_types</code> to exclude the desired filetype extension. This avoids mis-representing complex spreadsheets where there may be multiple sub-tables and other content.</li> <li><strong>Better debug output related to sentence counting internals</strong>. Clarify message when sentence is not counted toward sentence count because there aren't enough words, relevant for developers focused on <code>unstructured</code>s NLP internals.</li> <li><strong>Faster ocr_only speed for partitioning PDF and images.</strong> Use <code>unstructured_pytesseract.run_and_get_multiple_output</code> function to reduce the number of calls to <code>tesseract</code> by half when partitioning pdf or image with <code>tesseract</code></li> <li><strong>Adds data source properties to fsspec connectors</strong> These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.</li> <li><strong>Add delta table destination connector</strong> New delta table destination connector added to ingest CLI. Users may now use <code>unstructured-ingest</code> to write partitioned data from over 20 data sources (so far) to a Delta Table.</li> <li><strong>Rename to Source and Destination Connectors in the Documentation.</strong> Maintain naming consistency between Connectors codebase and documentation with the first addition to a destination connector.</li> <li><strong>Non-HTML text files now return unstructured-elements as opposed to HTML-elements.</strong> Previously the text based files that went through <code>partition_html</code> would return HTML-elements but now we preserve the format from the input using <code>source_format</code> argument in the partition call.</li> <li><strong>Adds <code>PaddleOCR</code> as an optional alternative to <code>Tesseract</code></strong> for OCR in processing of PDF or Image files, it is installable via the <code>makefile</code> command <code>install-paddleocr</code>. For experimental purposes only.</li> <li><strong>Bump unstructured-inference</strong> to 0.5.28. This version bump markedly improves the output of table data, rendered as <code>metadata.text_as_html</code> in an element. These changes include: <ul> <li>add env variable <code>ENTIRE_PAGE_OCR</code> to specify using paddle or tesseract on entire page OCR</li> <li>table structure detection now pads the input image by 25 pixels in all 4 directions to improve its recall (0.5.27)</li> <li>support paddle with both cpu and gpu and assumed it is pre-installed (0.5.26)</li> <li>fix a bug where <code>cells_to_html</code> doesn't handle cells spanning multiple rows properly (0.5.25)</li> <li>remove <code>cv2</code> preprocessing step before OCR step in table transformer (0.5.24)</li> </ul> </li> </ul> <h3>Features</h3> <ul> <li><strong>Adds element metadata via <code>category_depth</code> with default value None</strong>. <ul> <li>This additional metadata is useful for vectordb/LLM, chunking strategies, and retrieval applications.</li> </ul> </li> <li><strong>Adds a naive hierarchy for elements via a <code>parent_id</code> on the element's metadata</strong> <ul> <li>Users will now have more metadata for implementing vectordb/LLM chunking strategies. For example, text elements could be queried by their preceding title element.</li> <li>Title elements created from HTML headings will properly nest</li> </ul> </li> </ul> <h3>Fixes</h3> <ul> <li><strong><code>add_pytesseract_bboxes_to_elements</code> no longer returns <code>nan</code> values</strong>. The function logic is now broken into new methods <code>_get_element_box</code> and <code>convert_multiple_coordinates_to_new_system</code></li> <li><strong>Selecting a different model wasn't being respected when calling <code>partition_image</code>.</strong> Problem: <code>partition_pdf</code> allows for passing a <code>model_name</code> parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that <code>partition_image</code> should support the same parameter, but <code>partition_image</code> was unintentionally not passing along its <code>kwargs</code>. This was corrected by adding the kwargs to the downstream call.</li> <li><strong>Fixes a chunking issue via dropping the field &quot;coordinates&quot;.</strong> Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with &quot;coordinates&quot; essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key &quot;coordinates&quot; inside a list of excluded metadata keys, while doing this &quot;metadata_matches&quot; comparision. Importance: This change is crucial to be able to chunk by title for documents which include &quot;coordinates&quot; metadata in their elements.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/b534b2a6cdd91f2a6380b8e1097de28e141a0d8e"><code>b534b2a</code></a> Chore: bump inference package version to 0.5.28 and new release (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1355">#1355</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/09a0958f900a6748c217c9f022ca90b4ab01b3a5"><code>09a0958</code></a> Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aa...</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/36d026cb1bb009a275c3afb6a79bd7237c762027"><code>36d026c</code></a> chore: update CHANGELOG.md bullets (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1436">#1436</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/6187dc09768df825920dca0e323005712aad05d2"><code>6187dc0</code></a> update links in integrations.rst (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1418">#1418</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/333558494e6695717b73421cf9fea1a3285925ef"><code>3335584</code></a> roman/delta lake dest connector (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1385">#1385</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/98d3541909f64290b5efb65a226fc3ee8a7cc5ee"><code>98d3541</code></a> Update CHANGELOG.md (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1435">#1435</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/de4d496fcf64cfadfcdc4ab065c106287eb48637"><code>de4d496</code></a> Fix bbox coordinates for ocr_only strategy (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1325">#1325</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/0d61c9848170b8db090b121b97e3822dbeff4eab"><code>0d61c98</code></a> fix: Pass partition_image kwargs downstream (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1426">#1426</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/fe11ab4235ad2b2bc8328a036b4da33b7392f8fb"><code>fe11ab4</code></a> feat: improved mapping for missing chipper elements (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1431">#1431</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/50db2abd9f6f0eadd456a4b5026b4ff0dbdc5d75"><code>50db2ab</code></a> fix: updating element types (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1394">#1394</a>)</li> <li>Additional commits viewable in <a href="https://github.com/Unstructured-IO/unstructured/compare/0.10.14...0.10.15">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=unstructured[local-inference]&package-manager=pip&previous-version=0.10.14&new-version=0.10.15)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Austin Walker <[email protected]>
1 parent 6923a24 commit af24905

File tree

4 files changed

+54
-43
lines changed

4 files changed

+54
-43
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
## 0.0.45-dev0
1+
## 0.0.45
22

33
* Drop `detection_class_prob` from the element metadata. This broke backwards compatibility when library users called `partition_via_api`.
4+
* Bump unstructured to 0.10.15
45

56
## 0.0.44
67

requirements/base.txt

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ click==8.1.3
3131
# uvicorn
3232
coloredlogs==15.0.1
3333
# via onnxruntime
34-
contourpy==1.1.0
34+
contourpy==1.1.1
3535
# via matplotlib
3636
cryptography==41.0.3
3737
# via pdfminer-six
@@ -51,7 +51,7 @@ exceptiongroup==1.1.3
5151
# via anyio
5252
fastapi==0.103.1
5353
# via -r requirements/base.in
54-
filelock==3.12.3
54+
filelock==3.12.4
5555
# via
5656
# huggingface-hub
5757
# torch
@@ -62,11 +62,11 @@ flatbuffers==23.5.26
6262
# via onnxruntime
6363
fonttools==4.42.1
6464
# via matplotlib
65-
fsspec==2023.9.0
65+
fsspec==2023.9.1
6666
# via huggingface-hub
6767
h11==0.14.0
6868
# via uvicorn
69-
huggingface-hub==0.17.1
69+
huggingface-hub==0.17.2
7070
# via
7171
# timm
7272
# transformers
@@ -99,7 +99,7 @@ markupsafe==2.1.3
9999
# via jinja2
100100
marshmallow==3.20.1
101101
# via dataclasses-json
102-
matplotlib==3.7.3
102+
matplotlib==3.8.0
103103
# via pycocotools
104104
mpmath==1.3.0
105105
# via sympy
@@ -111,7 +111,7 @@ networkx==3.1
111111
# via torch
112112
nltk==3.8.1
113113
# via unstructured
114-
numpy==1.25.2
114+
numpy==1.26.0
115115
# via
116116
# contourpy
117117
# layoutparser
@@ -146,6 +146,7 @@ packaging==23.1
146146
# onnxruntime
147147
# pytesseract
148148
# transformers
149+
# unstructured-pytesseract
149150
pandas==2.1.0
150151
# via
151152
# layoutparser
@@ -160,7 +161,7 @@ pdfminer-six==20221105
160161
# unstructured
161162
pdfplumber==0.10.2
162163
# via layoutparser
163-
pillow==10.0.0
164+
pillow==10.0.1
164165
# via
165166
# layoutparser
166167
# matplotlib
@@ -169,7 +170,8 @@ pillow==10.0.0
169170
# pytesseract
170171
# python-pptx
171172
# torchvision
172-
portalocker==2.7.0
173+
# unstructured-pytesseract
174+
portalocker==2.8.2
173175
# via iopath
174176
protobuf==4.24.3
175177
# via
@@ -181,7 +183,7 @@ pycocotools==2.0.7
181183
# via effdet
182184
pycparser==2.21
183185
# via cffi
184-
pycryptodome==3.18.0
186+
pycryptodome==3.19.0
185187
# via -r requirements/base.in
186188
pydantic==1.10.12
187189
# via
@@ -191,7 +193,7 @@ pypandoc==1.11
191193
# via unstructured
192194
pyparsing==3.1.1
193195
# via matplotlib
194-
pypdf==3.16.0
196+
pypdf==3.16.1
195197
# via -r requirements/base.in
196198
pypdfium2==4.20.0
197199
# via pdfplumber
@@ -275,12 +277,11 @@ tqdm==4.66.1
275277
# iopath
276278
# nltk
277279
# transformers
278-
transformers==4.33.1
280+
transformers==4.33.2
279281
# via unstructured-inference
280-
typing-extensions==4.7.1
282+
typing-extensions==4.8.0
281283
# via
282284
# fastapi
283-
# filelock
284285
# huggingface-hub
285286
# iopath
286287
# onnx
@@ -293,15 +294,17 @@ typing-inspect==0.9.0
293294
# via dataclasses-json
294295
tzdata==2023.3
295296
# via pandas
296-
unstructured[local-inference]==0.10.14
297+
unstructured[local-inference]==0.10.15
297298
# via -r requirements/base.in
298-
unstructured-inference==0.5.25
299+
unstructured-inference==0.5.28
300+
# via unstructured
301+
unstructured-pytesseract==0.3.12
299302
# via unstructured
300303
urllib3==2.0.4
301304
# via requests
302305
uvicorn==0.23.2
303306
# via -r requirements/base.in
304307
xlrd==2.0.1
305308
# via unstructured
306-
xlsxwriter==3.1.3
309+
xlsxwriter==3.1.4
307310
# via python-pptx

requirements/test.txt

Lines changed: 29 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ comm==0.1.4
8787
# via
8888
# ipykernel
8989
# ipywidgets
90-
contourpy==1.1.0
90+
contourpy==1.1.1
9191
# via
9292
# -r requirements/base.txt
9393
# matplotlib
@@ -105,7 +105,7 @@ dataclasses-json==0.6.0
105105
# via
106106
# -r requirements/base.txt
107107
# unstructured
108-
debugpy==1.7.0
108+
debugpy==1.8.0
109109
# via ipykernel
110110
decorator==5.1.1
111111
# via ipython
@@ -146,7 +146,7 @@ fastcore==1.5.29
146146
# nbdev
147147
fastjsonschema==2.18.0
148148
# via nbformat
149-
filelock==3.12.3
149+
filelock==3.12.4
150150
# via
151151
# -r requirements/base.txt
152152
# huggingface-hub
@@ -168,7 +168,7 @@ fonttools==4.42.1
168168
# matplotlib
169169
fqdn==1.5.1
170170
# via jsonschema
171-
fsspec==2023.9.0
171+
fsspec==2023.9.1
172172
# via
173173
# -r requirements/base.txt
174174
# huggingface-hub
@@ -183,7 +183,7 @@ httpcore==0.18.0
183183
# via httpx
184184
httpx==0.25.0
185185
# via -r requirements/test.in
186-
huggingface-hub==0.17.1
186+
huggingface-hub==0.17.2
187187
# via
188188
# -r requirements/base.txt
189189
# timm
@@ -220,7 +220,7 @@ ipython==8.15.0
220220
# jupyter-console
221221
ipython-genutils==0.2.0
222222
# via qtconsole
223-
ipywidgets==8.1.0
223+
ipywidgets==8.1.1
224224
# via jupyter
225225
isoduration==20.11.0
226226
# via jsonschema
@@ -284,15 +284,15 @@ jupyter-server==2.7.3
284284
# notebook-shim
285285
jupyter-server-terminals==0.4.4
286286
# via jupyter-server
287-
jupyterlab==4.0.5
287+
jupyterlab==4.0.6
288288
# via notebook
289289
jupyterlab-pygments==0.2.2
290290
# via nbconvert
291291
jupyterlab-server==2.25.0
292292
# via
293293
# jupyterlab
294294
# notebook
295-
jupyterlab-widgets==3.0.8
295+
jupyterlab-widgets==3.0.9
296296
# via ipywidgets
297297
kiwisolver==1.4.5
298298
# via
@@ -322,7 +322,7 @@ marshmallow==3.20.1
322322
# via
323323
# -r requirements/base.txt
324324
# dataclasses-json
325-
matplotlib==3.7.3
325+
matplotlib==3.8.0
326326
# via
327327
# -r requirements/base.txt
328328
# pycocotools
@@ -363,7 +363,7 @@ nbformat==5.9.2
363363
# jupyter-server
364364
# nbclient
365365
# nbconvert
366-
nest-asyncio==1.5.7
366+
nest-asyncio==1.5.8
367367
# via ipykernel
368368
networkx==3.1
369369
# via
@@ -379,7 +379,7 @@ notebook-shim==0.2.3
379379
# via
380380
# jupyterlab
381381
# notebook
382-
numpy==1.25.2
382+
numpy==1.26.0
383383
# via
384384
# -r requirements/base.txt
385385
# contourpy
@@ -440,6 +440,7 @@ packaging==23.1
440440
# qtconsole
441441
# qtpy
442442
# transformers
443+
# unstructured-pytesseract
443444
pandas==2.1.0
444445
# via
445446
# -r requirements/base.txt
@@ -469,7 +470,7 @@ pexpect==4.8.0
469470
# via ipython
470471
pickleshare==0.7.5
471472
# via ipython
472-
pillow==10.0.0
473+
pillow==10.0.1
473474
# via
474475
# -r requirements/base.txt
475476
# layoutparser
@@ -479,13 +480,14 @@ pillow==10.0.0
479480
# pytesseract
480481
# python-pptx
481482
# torchvision
483+
# unstructured-pytesseract
482484
platformdirs==3.10.0
483485
# via
484486
# black
485487
# jupyter-core
486488
pluggy==1.3.0
487489
# via pytest
488-
portalocker==2.7.0
490+
portalocker==2.8.2
489491
# via
490492
# -r requirements/base.txt
491493
# iopath
@@ -520,7 +522,7 @@ pycparser==2.21
520522
# via
521523
# -r requirements/base.txt
522524
# cffi
523-
pycryptodome==3.18.0
525+
pycryptodome==3.19.0
524526
# via -r requirements/base.txt
525527
pydantic==1.10.12
526528
# via
@@ -542,7 +544,7 @@ pyparsing==3.1.1
542544
# via
543545
# -r requirements/base.txt
544546
# matplotlib
545-
pypdf==3.16.0
547+
pypdf==3.16.1
546548
# via -r requirements/base.txt
547549
pypdfium2==4.20.0
548550
# via
@@ -638,7 +640,7 @@ rfc3986-validator==0.1.1
638640
# via
639641
# jsonschema
640642
# jupyter-events
641-
rpds-py==0.10.2
643+
rpds-py==0.10.3
642644
# via
643645
# jsonschema
644646
# referencing
@@ -737,7 +739,7 @@ tqdm==4.66.1
737739
# iopath
738740
# nltk
739741
# transformers
740-
traitlets==5.9.0
742+
traitlets==5.10.0
741743
# via
742744
# comm
743745
# ipykernel
@@ -754,17 +756,16 @@ traitlets==5.9.0
754756
# nbconvert
755757
# nbformat
756758
# qtconsole
757-
transformers==4.33.1
759+
transformers==4.33.2
758760
# via
759761
# -r requirements/base.txt
760762
# unstructured-inference
761-
typing-extensions==4.7.1
763+
typing-extensions==4.8.0
762764
# via
763765
# -r requirements/base.txt
764766
# async-lru
765767
# black
766768
# fastapi
767-
# filelock
768769
# huggingface-hub
769770
# iopath
770771
# mypy
@@ -781,9 +782,13 @@ tzdata==2023.3
781782
# via
782783
# -r requirements/base.txt
783784
# pandas
784-
unstructured[local-inference]==0.10.14
785+
unstructured[local-inference]==0.10.15
785786
# via -r requirements/base.txt
786-
unstructured-inference==0.5.25
787+
unstructured-inference==0.5.28
788+
# via
789+
# -r requirements/base.txt
790+
# unstructured
791+
unstructured-pytesseract==0.3.12
787792
# via
788793
# -r requirements/base.txt
789794
# unstructured
@@ -809,13 +814,13 @@ websocket-client==1.6.3
809814
# via jupyter-server
810815
wheel==0.41.2
811816
# via astunparse
812-
widgetsnbextension==4.0.8
817+
widgetsnbextension==4.0.9
813818
# via ipywidgets
814819
xlrd==2.0.1
815820
# via
816821
# -r requirements/base.txt
817822
# unstructured
818-
xlsxwriter==3.1.3
823+
xlsxwriter==3.1.4
819824
# via
820825
# -r requirements/base.txt
821826
# python-pptx

scripts/parallel-mode-test.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,9 @@ do
2727
echo Testing: "$curl_command"
2828

2929
# Run in single mode
30-
$curl_command 2> /dev/null | jq -S > output.json
30+
# Note(austin): Parallel mode screws up hierarchy! While we deal with that,
31+
# let's ignore parent_id fields in the results
32+
$curl_command 2> /dev/null | jq -S 'del(..|.parent_id?)' > output.json
3133

3234
# Stop if curl didn't work
3335
if [ ! -s output.json ]; then
@@ -38,7 +40,7 @@ do
3840

3941
# Run in parallel mode
4042
curl_command="curl $base_url_2/general/v0/general $params"
41-
$curl_command 2> /dev/null | jq -S > parallel_output.json
43+
$curl_command 2> /dev/null | jq -S 'del(..|.parent_id?)' > parallel_output.json
4244

4345
# Stop if curl didn't work
4446
if [ ! -s parallel_output.json ]; then

0 commit comments

Comments
 (0)