Releases · Unstructured-IO/unstructured · GitHub

14 Jun 06:06

cragwolfe

0.7.5

0.7.5

Enhancements

Adds functionality to sort elements in partition_pdf for fast strategy
Adds ingest tests with --fast strategy on PDF documents
Adds --api-key to unstructured-ingest

Features

Adds partition_rst for processed ReStructured Text documents.

Fixes

Adds handling for emails that do not have a datetime to extract.
Adds pdf2image package as core requirement of unstructured (with no extras)

Assets 2

12 Jun 18:41

yuming-long

0.7.4

0.7.4

Enhancements

Allows passing kwargs to request data field for partition_via_api and partition_multiple_via_api
Enable MIME type detection if libmagic is not available
Adds handling for empty files in detect_filetype and partition.

Features

Fixes

Reslove grpcio import issue on weaviate.schema.validate_schema for python 3.9 and 3.10
Remove building detectron2 from source in Dockerfile

Assets 2

09 Jun 18:16

yuming-long

0.7.3

0.7.3

Enhancements

Update IngestDoc abstractions and add data source metadata in ElementMetadata

Features

Fixes

Pass strategy parameter down from partition for partition_image
Filetype detection if a CSV has a text/plain MIME type
convert_office_doc no longers prints file conversion info messages to stdout.
partition_via_api reflects the actual filetype for the file processed in the API.

Assets 2

07 Jun 17:22

MthwRobinson

0.7.2

0.7.2

Enhancements

Adds an optional encoding kwarg to elements_to_json and elements_from_json
Bump version of base image to use new stable version of tesseract

Features

Fixes

Update the read_txt_file utility function to keep using spooled_to_bytes_io_if_needed for xml
Add functionality to the read_txt_file utility function to handle file-like object from URL
Remove the unused parameter encoding from partition_pdf
Change auto.py to have a None default for encoding
Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
Adds benchmark test with test docs in example-docs
Re-enable test_upload_label_studio_data_with_sdk
File detection now detects code files as plain text
Adds tabulate explicitly to dependencies
Fixes an issue in metadata.page_number of pptx files
Adds showing help if no parameters passed

Assets 2

01 Jun 20:52

MthwRobinson

0.7.1

0.7.1

Enhancements

Features

Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas.
Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes

Assets 2

31 May 20:13

MthwRobinson

0.7.0

0.7.0

Enhancements

Installing detectron2 from source is no longer required when using the local-inference extra.
Updates .pptx parsing to include text in tables.

Features

Fixes

Fixes an issue in _add_element_metadata that caused all elements to have page_number=1
in the element metadata.
Adds .log as a file extension for TXT files.
Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
Allow passed encoding to be used in the replace_mime_encodings
Fixes page metadata for partition_html when include_metadata=False
A ValueError now raises if file_filename is not specified when you use partition_via_api
with a file-like object.

Assets 2

30 May 13:47

yuming-long

0.6.11

0.6.11

Enhancements

Supports epub tests since pandoc is updated in base image

Features

Fixes

Assets 2

26 May 08:57

cragwolfe

0.6.10

0.6.10

Enhancements

XLS support from auto partition

Features

Fixes

Assets 2

24 May 22:31

qued

0.6.9

0.6.9

Enhancements

fast strategy for pdf now keeps element bounding box data
setup.py refactor

Features

Fixes

Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
Adds additional MIME types for CSV

Assets 2

19 May 19:58

MthwRobinson

0.6.8

0.6.8

Enhancements

Features

Add partition_csv for CSV files.

Fixes

Assets 2