Releases · Unstructured-IO/unstructured · GitHub

03 Feb 22:15

MthwRobinson

0.4.6

0.4.6

Loosen the default cap threshold to 0.5.
Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
the cap ratio threshold.
Unknown text elements are identified as Text for HTML and plain text documents.
Body Text styles no longer default to NarrativeText for Word documents. The style information
is insufficient to determine that the text is narrative.
Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
Adds an Address element for capturing elements that only contain an address.
Suppress the UserWarning when detectron is called.
Checks that titles and narrative test have at least one English word.
Checks that titles and narrative text are at least 50% alpha characters.
Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
environment variable for controlling the max number of words in a title.
Updated partition_pptx to order the elements on the page

Assets 2

25 Jan 17:01

MthwRobinson

0.4.4

0.4.4

Updated partition_pdf and partition_image to return unstructured Element objects
Fixed the healthcheck url path when partitioning images and PDFs via API
Adds an optional coordinates attribute to document objects
Adds FigureCaption and CheckBox document elements
Added ability to split lists detected in LayoutElement objects
Adds partition_pptx for partitioning PowerPoint documents
LayoutParser models now download from HugginfaceHub instead of DropBox
Fixed file type detection for XML and HTML files on Amazone Linux

Assets 2

18 Jan 17:31

MthwRobinson

0.4.3

0.4.3

Adds requests as a base dependency
Fix in exceeds_cap_ratio so the function doesn't break with empty text
Fix bug in _parse_received_data.
Update detect_filetype to properly handle .doc, .xls, and .ppt.

Assets 2

17 Jan 16:36

MthwRobinson

0.4.2

0.4.2

Added partition_image to process documents in an image format.
Fixed utf-8 encoding error in partition_email with attachments for text/html

Assets 2

13 Jan 22:23

MthwRobinson

0.4.1

0.4.1

Added support for text files in the partition function
Pinned opencv-python for easier installation on Linux

Assets 2

11 Jan 18:05

MthwRobinson

0.4.0

0.4.0

Added generic partition brick that detects the file type and routes a file to the appropriate
partitioning brick.
Added a file type detection module.
Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
Cleaning brick for removing ordered bullets clean_ordered_bullets.
Extract brick method for ordered bullets extract_ordered_bullets.
Test for clean_ordered_bullets.
Test for extract_ordered_bullets.
Added partition_docx for pre-processing Word Documents.
Added new REGEX patterns to extract email header information
Added new functions to extract header information parse_received_data and partition_header
Added new function to parse plain text files partition_text
Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
Add new Image element and function to find embedded images find_embedded_images
Added get_directory_file_info for summarizing information about source documents

Assets 2

05 Jan 00:50

qued

0.3.5

0.3.5

Add support for local inference
Add new pattern to recognize plain text dash bullets
Add test for bullet patterns
Fix for partition_html that allows for processing div tags that have both text and child elements
Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
Helper functions for identifying and extracting phone numbers
Add new function extract_attachment_info that extracts and decode the attachment of an email.
Staging brick to convert a list of Elements to a pandas dataframe.

Assets 2

21 Dec 15:29

MthwRobinson

0.3.4

0.3.4

Python-3.7 compat

Assets 2

20 Dec 20:03

yuming-long

0.3.3

0.3.3

Removes BasicConfig from logger configuration
Adds the partition_email partitioning brick
Adds the replace_mime_encodings cleaning bricks
Small fix to HTML parsing related to processing list items with sub-tags

Assets 2

15 Dec 22:20

MthwRobinson

0.3.2

0.3.2

Added translate_text brick for translating text between languages
Add an apply method to make it easier to apply cleaners to elements

Assets 2