Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.6
0.4.6
- Loosen the default cap threshold to
0.5. - Add a
UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLDenvironment variable for controlling
the cap ratio threshold. - Unknown text elements are identified as
Textfor HTML and plain text documents. Body Textstyles no longer default toNarrativeTextfor Word documents. The style information
is insufficient to determine that the text is narrative.- Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
- Adds an
Addresselement for capturing elements that only contain an address. - Suppress the
UserWarningwhen detectron is called. - Checks that titles and narrative test have at least one English word.
- Checks that titles and narrative text are at least 50% alpha characters.
- Restricts titles to a maximum word length. Adds a
UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
environment variable for controlling the max number of words in a title. - Updated
partition_pptxto order the elements on the page
0.4.4
0.4.4
- Updated
partition_pdfandpartition_imageto returnunstructuredElementobjects - Fixed the healthcheck url path when partitioning images and PDFs via API
- Adds an optional
coordinatesattribute to document objects - Adds
FigureCaptionandCheckBoxdocument elements - Added ability to split lists detected in
LayoutElementobjects - Adds
partition_pptxfor partitioning PowerPoint documents - LayoutParser models now download from HugginfaceHub instead of DropBox
- Fixed file type detection for XML and HTML files on Amazone Linux
0.4.3
0.4.3
- Adds
requestsas a base dependency - Fix in
exceeds_cap_ratioso the function doesn't break with empty text - Fix bug in
_parse_received_data. - Update
detect_filetypeto properly handle.doc,.xls, and.ppt.
0.4.2
0.4.2
- Added
partition_imageto process documents in an image format. - Fixed utf-8 encoding error in
partition_emailwith attachments fortext/html
0.4.1
0.4.1
- Added support for text files in the
partitionfunction - Pinned
opencv-pythonfor easier installation on Linux
0.4.0
0.4.0
- Added generic
partitionbrick that detects the file type and routes a file to the appropriate
partitioning brick. - Added a file type detection module.
- Updated
partition_htmlandpartition_emlto support file-like objects in 'rb' mode. - Cleaning brick for removing ordered bullets
clean_ordered_bullets. - Extract brick method for ordered bullets
extract_ordered_bullets. - Test for
clean_ordered_bullets. - Test for
extract_ordered_bullets. - Added
partition_docxfor pre-processing Word Documents. - Added new REGEX patterns to extract email header information
- Added new functions to extract header information
parse_received_dataandpartition_header - Added new function to parse plain text files
partition_text - Added new cleaners functions
extract_ip_address,extract_ip_address_name,extract_mapi_id,extract_datetimetz - Add new
Imageelement and function to find embedded imagesfind_embedded_images - Added
get_directory_file_infofor summarizing information about source documents
0.3.5
0.3.5
- Add support for local inference
- Add new pattern to recognize plain text dash bullets
- Add test for bullet patterns
- Fix for partition_html that allows for processing div tags that have both text and child elements
- Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
- Helper functions for identifying and extracting phone numbers
- Add new function extract_attachment_info that extracts and decode the attachment of an email.
- Staging brick to convert a list of Elements to a pandas dataframe.
0.3.4
0.3.4
- Python-3.7 compat
0.3.3
0.3.3
- Removes BasicConfig from logger configuration
- Adds the
partition_emailpartitioning brick - Adds the
replace_mime_encodingscleaning bricks - Small fix to HTML parsing related to processing list items with sub-tags
0.3.2
0.3.2
- Added
translate_textbrick for translating text between languages - Add an
applymethod to make it easier to apply cleaners to elements