Skip to content

Releases: Unstructured-IO/unstructured

0.8.4

26 Jul 18:09
1e2d531

Choose a tag to compare

0.8.4

Enhancements

  • Additional tests and refactor of JSON detection.
  • Update functionality to retrieve image metadata from a page for document_to_element_list
  • Links are now tracked in partition_html output.
  • Set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
  • set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add slide notes to pptx
  • Add --encoding directive to ingest
  • Improve json detection by detect_filetype

Features

  • Adds Outlook connector
  • Add support for dpi parameter in inference library
  • Adds Onedrive connector.
  • Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

  • Fixes issue with email partitioning where From field was being assigned the To field value.
  • Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
  • Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
  • Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
  • Adds .txt, .text, and .tab to list of extensions to check if file
    has a text/plain MIME type.
  • Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
  • Removed old error message that's superseded by requires_dependencies.
  • Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

0.8.1: * Add support for Python 3.11

11 Jul 14:35

Choose a tag to compare

0.8.1

Enhancements

  • Add support for Python 3.11

Features

Fixes

  • Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
  • Fix list detection in MS Word documents.
  • Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

0.8.0

07 Jul 15:41
5e11501

Choose a tag to compare

Enhancements

  • Allow model used for hi res pdf partition strategy to be chosen when called.
  • Updated inference package

Features

  • Add metadata_filename parameter across all partition functions

Fixes

  • Adjust encoding recognition threshold value in detect_file_encoding

  • Fix KeyError when isd_to_elements doesn't find a type

  • Fix _output_filename for local connector, allowing single files to be written correctly to the disk

  • Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

  • Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

0.7.12

01 Jul 02:32
6249e15

Choose a tag to compare

0.7.12

Enhancements

  • Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

  • Adds Dropbox connector

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.11

30 Jun 01:42
350bb1d

Choose a tag to compare

0.7.11

Enhancements

  • More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
  • Make large model available (from unstructured-inference bump to 0.5.3)
  • Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
  • partition_email and partition_msg will now process attachments if process_attachments=True
    and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.10

28 Jun 19:27
44411ec

Choose a tag to compare

0.7.10

Enhancements

  • Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
    partition_msg and partition_xml that sets a limit for the size of an individual
    document elements. Defaults to 1500 for everything except partition_xml, which has
    a default value of None.
  • DRY connector refactor

Features

  • hi_res model for pdfs and images is selectable via environment variable.

Fixes

  • CSV check now ignores escaped commas.
  • Fix for filetype exploration util when file content does not have a comma.
  • Adds negative lookahead to bullet pattern to avoid detecting plain text line
    breaks like ------- as list items.
  • Fix pre tag parsing for partition_html
  • Fix lookup error for annotated Arabic and Hebrew encodings

0.7.9

26 Jun 21:54
95f02f2

Choose a tag to compare

0.7.9

Enhancements

  • Improvements to string check for leafs in partition_xml.
  • Adds --partition-ocr-languages to unstructured-ingest.

Features

  • Adds partition_org for processed Org Mode documents.

Fixes

0.7.8

23 Jun 02:23
5f5da65

Choose a tag to compare

0.7.8

Enhancements

Features

  • Adds Google Cloud Service connector

Fixes

  • Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
  • partition_email now works if there is no message content
  • Updates the "fast" strategy for partition_pdf so that it's able to recursively
  • Adds recursive functionality to all fsspec connectors
  • Adds generic --recursive ingest flag

0.7.7

20 Jun 19:13
c53ce11

Choose a tag to compare

0.7.7

Enhancements

  • Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs
  • Adds missed file-like object handling in detect_file_encoding
  • Adds functionality to extract charset info from eml files

Features

  • Added coordinate system class to track coordinate types and convert to different coordinate

Fixes

  • Adds an html_assemble_articles kwarg to partition_html to enable users to capture
    control whether content outside of <article> tags is captured when
    <article> tags are present.
  • Check for the xml attribute on element before looking for pagebreaks in partition_docx.

0.7.6

16 Jun 15:09
a611532

Choose a tag to compare

0.7.6

Enhancements

  • Convert fast startegy to ocr_only for images
  • Adds support for page numbers in .docx and .doc when user or renderer
    created page breaks are present.
  • Adds retry logic for the unstructured-ingest Biomed connector

Features

  • Provides users with the ability to extract additional metadata via regex.
  • Updates partition_docx to include headers and footers in the output.
  • Create partition_tsv and associated tests. Make additional changes to detect_filetype.

Fixes

  • Remove fake api key in test partition_via_api since we now require valid/empty api keys
  • Page number defaults to None instead of 1 when page number is not present in the metadata.
    A page number of None indicates that page numbers are not being tracked for the document
    or that page numbers do not apply to the element in question..
  • Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
    in case the shape.top and shape.left attributes are None.