Releases · Unstructured-IO/unstructured

27 Aug 15:55

MthwRobinson

0.15.8

4194a07

0.15.8

Enhancements

Bump unstructured.paddleocr to 2.8.1.0.

Features

Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

Assets 2

20 Aug 19:53

christinestraub

0.15.7

01dbc7b

0.15.7

Enhancements

Features

Fixes

Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

Assets 2

20 Aug 12:47

MthwRobinson

0.15.6

1f8030d

0.15.6

Enhancements

Features

Fixes

Bump to NLTK 3.9.x Bumps to the latest nltk version to resolve CVE.
Update CI for ingest-test-fixture-update-pr to resolve NLTK model download errors.
Synchronized text and html on TableChunk splits. When a Table element is divided during chunking to fit the chunking window, TableChunk.text corresponds exactly with the table text in TableChunk.metadata.text_as_html, .text_as_html is always parseable HTML, and the table is split on even row boundaries whenever possible.

Assets 2

16 Aug 14:35

MthwRobinson

0.15.5

fc26426

0.15.5

Enhancements

Features

Fixes

Revert to using unstructured.pytesseract fork. Due to the unavailability of some recent release versions of pytesseract on PyPI, the project now uses the unstructured.pytesseract fork to ensure stability and continued support.
Bump libreoffice verson in image. Bumps the libreoffice version to 25.2.5.2 to address CVEs.
Downgrade NLTK dependency version for compatibility. Due to the unavailability of nltk==3.8.2 on PyPI, the NLTK dependency has been downgraded to <3.8.2. This change ensures continued functionality and compatibility.

Assets 2

14 Aug 21:18

christinestraub

0.15.4

9b778e2

0.15.4

Enhancements

Features

Fixes

Resolve an installation error with pytesseract>=0.3.12 that occurred during pip install unstructured[pdf]==0.15.3.

Assets 2

14 Aug 17:23

christinestraub

0.15.3

d6a84bd

0.15.3

Enhancements

Features

Fixes

Remove the custom index URL from extra-paddleocr.in to resolve the error in the setup.py configuration.

Assets 2

13 Aug 13:40

MthwRobinson

0.15.2

7437f0a

0.15.2

Enhancements

Improve directory handling when extracting image blocks. The figures directory is no longer created when the extract_image_block_to_payload parameter is set to True.

Features

Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.

Fixes

Updates NLTK data file for compatibility with nltk>=3.8.2. The NLTK data file now container punkt_tab, making it possible to upgrade to nltk>=3.8.2. The nltk==3.8.2 patches CVE-2024-39705.
Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
Accommodate single-column CSV files. Resolves a limitation of partition_csv() where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
Accommodate image/jpg in PPTX as alias for image/jpeg. Resolves problem partitioning PPTX files having an invalid image/jpg (should be image/jpeg) MIME-type in the [Content_Types].xml member of the PPTX Zip archive.
Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.

Assets 2

05 Aug 17:36

christinestraub

0.15.1

7e88744

0.15.1

Enhancements

Improve pdfminer embedded image extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction in pdf partitioning.

Features

Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and Recipient elements are generated for cc and bcc when include_headers=True for email partitioning.
Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
Add pdf_hi_res_max_pages argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the high_res strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.

Fixes

Update HuggingFaceEmbeddingEncoder to use HuggingFaceEmbeddings from langchain_huggingface package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
Update OpenAIEmbeddingEncoder to use OpenAIEmbeddings from langchain-openai package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
File-type detection catches non-existent file-path. detect_filetype() no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead FileNotFoundError is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.
EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to partition() as a file-path was identified as TXT and partitioned using partition_text(). EML files specified by path are now identified and processed correctly, including processing any attachments.
A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to partition() would raise when gzip compression was used for transport by the server.
A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling partition() with a swapped MS-Office content_type would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received by partition() is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type.
DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.

Assets 2

19 Jul 19:21

christinestraub

0.15.0

ec59abf

0.15.0

Enhancements

Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
Bump unstructured.paddleocr to 2.8.0.1.
Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
Move Astra embedded_dimension to write config

Assets 2

09 Jul 11:33

MthwRobinson

0.14.10

7b25dfc

0.14.10

Enhancements

Update unstructured-client dependency Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update.
.doc files are now supported in the arm64 image.. libreoffice24 is added to the arm64 image, meaning .doc files are now supported. We have follow on work planned to investigate adding .ppt support for arm64 as well.
Add table detection metrics: recall, precision and f1
Remove unused _with_spans metrics

Features

Fixes

Fix counting false negatives and false positives in table structure evaluation
Fix Slack CI test Change channel that Slack test is pointing to because previous test bot expired
Remove NLTK download Removes nltk.download in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705

Assets 2

Releases: Unstructured-IO/unstructured

0.15.8

0.15.8

Enhancements

Features

Fixes

Uh oh!

0.15.7

0.15.7

Enhancements

Features

Fixes

Uh oh!

0.15.6

0.15.6

Enhancements

Features

Fixes

Uh oh!

0.15.5

0.15.5

Enhancements

Features

Fixes

Uh oh!

0.15.4

0.15.4

Enhancements

Features

Fixes

Uh oh!

0.15.3

0.15.3

Enhancements

Features

Fixes

Uh oh!

0.15.2

0.15.2

Enhancements

Features

Fixes

Uh oh!

0.15.1

0.15.1

Enhancements

Features

Fixes

Uh oh!

0.15.0

0.15.0

Enhancements

Features

Fixes

Uh oh!

0.14.10

0.14.10

Enhancements

Features

Fixes

Uh oh!