Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.15.8
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
Fixes
- Replace
pillow-heifwithpi-heif. Replacespillow-heifwithpi-heifdue to more permissive licensing on the wheel forpi-heif. - Minify text_as_html from DOCX. Previously
.metadata.text_as_htmlfor DOCX tables was "bloated" with whitespace and noise elements introduced bytabulatethat produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetypewas incorrectly identified as a MSG file.
0.15.7
0.15.7
Enhancements
Features
Fixes
- Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
0.15.6
0.15.6
Enhancements
Features
Fixes
- Bump to NLTK 3.9.x Bumps to the latest
nltkversion to resolve CVE. - Update CI for
ingest-test-fixture-update-prto resolve NLTK model download errors. - Synchronized text and html on
TableChunksplits. When aTableelement is divided during chunking to fit the chunking window,TableChunk.textcorresponds exactly with the table text inTableChunk.metadata.text_as_html,.text_as_htmlis always parseable HTML, and the table is split on even row boundaries whenever possible.
0.15.5
0.15.5
Enhancements
Features
Fixes
- Revert to using
unstructured.pytesseractfork. Due to the unavailability of some recent release versions ofpytesseracton PyPI, the project now uses theunstructured.pytesseractfork to ensure stability and continued support. - Bump
libreofficeverson in image. Bumps thelibreofficeversion to25.2.5.2to address CVEs. - Downgrade NLTK dependency version for compatibility. Due to the unavailability of
nltk==3.8.2on PyPI, the NLTK dependency has been downgraded to<3.8.2. This change ensures continued functionality and compatibility.
0.15.4
0.15.4
Enhancements
Features
Fixes
- Resolve an installation error with
pytesseract>=0.3.12that occurred duringpip install unstructured[pdf]==0.15.3.
0.15.3
0.15.3
Enhancements
Features
Fixes
- Remove the custom index URL from
extra-paddleocr.into resolve the error in thesetup.pyconfiguration.
0.15.2
0.15.2
Enhancements
- Improve directory handling when extracting image blocks. The
figuresdirectory is no longer created when theextract_image_block_to_payloadparameter is set toTrue.
Features
- Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.
Fixes
- Updates NLTK data file for compatibility with
nltk>=3.8.2. The NLTK data file now containerpunkt_tab, making it possible to upgrade tonltk>=3.8.2. Thenltk==3.8.2patches CVE-2024-39705. - Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
- Accommodate single-column CSV files. Resolves a limitation of
partition_csv()where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters). - Accommodate
image/jpgin PPTX as alias forimage/jpeg. Resolves problem partitioning PPTX files having an invalidimage/jpg(should beimage/jpeg) MIME-type in the[Content_Types].xmlmember of the PPTX Zip archive. - Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
- Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.
0.15.1
0.15.1
Enhancements
- Improve
pdfminerembeddedimageextraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction inpdfpartitioning.
Features
- Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and
Recipientelements are generated for cc and bcc wheninclude_headers=Truefor email partitioning. - Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
- Add
pdf_hi_res_max_pagesargument for partitioning, which allows rejecting PDF files that exceed this page number limit, when thehigh_resstrategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.
Fixes
- Update
HuggingFaceEmbeddingEncoderto useHuggingFaceEmbeddingsfromlangchain_huggingfacepackage instead of the deprecated version fromlangchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update
OpenAIEmbeddingEncoderto useOpenAIEmbeddingsfromlangchain-openaipackage instead of the deprecated version fromlangchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
- File-type detection catches non-existent file-path.
detect_filetype()no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. InsteadFileNotFoundErroris raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened. - EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to
partition()as a file-path was identified as TXT and partitioned usingpartition_text(). EML files specified by path are now identified and processed correctly, including processing any attachments. - A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
- Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to
partition()would raise whengzipcompression was used for transport by the server. - A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling
partition()with a swapped MS-Officecontent_typewould cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received bypartition()is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type. - DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.
0.15.0
0.15.0
Enhancements
- Improve text clearing process in email partitioning. Updated the email partitioner to remove both
=\nand=\r\ncharacters during the clearing process. Previously, only=\ncharacters were removed. - Bump unstructured.paddleocr to 2.8.0.1.
- Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g.
<p>,<div>) nested inside a phrasing element (e.g.<strong>or<cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation. - Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
- CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.
Features
- Add support for specifying OCR language to
partition_pdf(). Extend language specification capability toPaddleOCRin addition toTesseractOCR. Users can now specify OCR languages for both OCR engines when usingpartition_pdf(). - Add AstraDB source connector Adds support for ingesting documents from AstraDB.
Fixes
- Remedy error on Windows when
nltkbinaries are downloaded. Work around a quirk in the Windows implementation oftempfile.NamedTemporaryFilewhere accessing the temporary file by name raisesPermissionError. - Move Astra embedded_dimension to write config
0.14.10
0.14.10
Enhancements
- Update unstructured-client dependency Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update. .docfiles are now supported in thearm64image..libreoffice24is added to thearm64image, meaning.docfiles are now supported. We have follow on work planned to investigate adding.pptsupport forarm64as well.- Add table detection metrics: recall, precision and f1
- Remove unused _with_spans metrics
Features
Fixes
- Fix counting false negatives and false positives in table structure evaluation
- Fix Slack CI test Change channel that Slack test is pointing to because previous test bot expired
- Remove NLTK download Removes
nltk.downloadin favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705