Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.8.4
0.8.4
Enhancements
- Additional tests and refactor of JSON detection.
- Update functionality to retrieve image metadata from a page for
document_to_element_list - Links are now tracked in
partition_htmloutput. - Set the file's current position to the beginning after reading the file in
convert_to_bytes - Add
min_partitionkwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split. - set the file's current position to the beginning after reading the file in
convert_to_bytes - Add slide notes to pptx
- Add
--encodingdirective to ingest - Improve json detection by
detect_filetype
Features
- Adds Outlook connector
- Add support for dpi parameter in inference library
- Adds Onedrive connector.
- Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.
Fixes
- Fixes issue with email partitioning where From field was being assigned the To field value.
- Use the
image_metadataproperty of thePageLayoutinstance to get the page image info in thedocument_to_element_list - Add functionality to write images to computer storage temporarily instead of keeping them in memory for
ocr_onlystrategy - Add functionality to convert a PDF in small chunks of pages at a time for
ocr_onlystrategy - Adds
.txt,.text, and.tabto list of extensions to check if file
has atext/plainMIME type. - Enables filters to be passed to
partition_docso it doesn't error with LibreOffice7. - Removed old error message that's superseded by
requires_dependencies. - Removes using
hi_resas the default strategy value forpartition_via_apiandpartition_multiple_via_api
0.8.1: * Add support for Python 3.11
0.8.1
Enhancements
- Add support for Python 3.11
Features
Fixes
- Fixed
autostrategy detected scanned document as having extractable text and usingfaststrategy, resulting in no output. - Fix list detection in MS Word documents.
- Don't instantiate an element with a coordinate system when there isn't a way to get its location data.
0.8.0
Enhancements
- Allow model used for hi res pdf partition strategy to be chosen when called.
- Updated inference package
Features
- Add metadata_filename parameter across all partition functions
Fixes
-
Adjust encoding recognition threshold value in
detect_file_encoding -
Fix KeyError when
isd_to_elementsdoesn't find a type -
Fix _output_filename for local connector, allowing single files to be written correctly to the disk
-
Fix for cases where an invalid encoding is extracted from an email header.
BREAKING CHANGES
- Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the
coordinatesattribute of the element's metadata.
0.7.12
0.7.12
Enhancements
- Adds
include_metadatakwarg topartition_doc,partition_docx,partition_email,partition_epub,partition_json,partition_msg,partition_odt,partition_org,partition_pdf,partition_ppt,partition_pptx,partition_rst, andpartition_rtf
Features
- Adds Dropbox connector
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.11
0.7.11
Enhancements
- More deterministic element ordering when using
hi_resPDF parsing strategy (from unstructured-inference bump to 0.5.4) - Make large model available (from unstructured-inference bump to 0.5.3)
- Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
partition_emailandpartition_msgwill now process attachments ifprocess_attachments=True
and a attachment partitioning functions is passed through withattachment_partitioner=partition.
Features
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.10
0.7.10
Enhancements
- Adds a
max_partitionparameter topartition_text,partition_pdf,partition_email,
partition_msgandpartition_xmlthat sets a limit for the size of an individual
document elements. Defaults to1500for everything exceptpartition_xml, which has
a default value ofNone. - DRY connector refactor
Features
hi_resmodel for pdfs and images is selectable via environment variable.
Fixes
- CSV check now ignores escaped commas.
- Fix for filetype exploration util when file content does not have a comma.
- Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like-------as list items. - Fix pre tag parsing for
partition_html - Fix lookup error for annotated Arabic and Hebrew encodings
0.7.9
0.7.8
0.7.8
Enhancements
Features
- Adds Google Cloud Service connector
Fixes
- Updates the
parse_emailforpartition_emlso thatunstructured-apipasses the smoke tests partition_emailnow works if there is no message content- Updates the
"fast"strategy forpartition_pdfso that it's able to recursively - Adds recursive functionality to all fsspec connectors
- Adds generic --recursive ingest flag
0.7.7
0.7.7
Enhancements
- Adds functionality to replace the
MIMEencodings foremlfiles with one of the common encodings if aunicodeerror occurs - Adds missed file-like object handling in
detect_file_encoding - Adds functionality to extract charset info from
emlfiles
Features
- Added coordinate system class to track coordinate types and convert to different coordinate
Fixes
- Adds an
html_assemble_articleskwarg topartition_htmlto enable users to capture
control whether content outside of<article>tags is captured when
<article>tags are present. - Check for the
xmlattribute onelementbefore looking for pagebreaks inpartition_docx.
0.7.6
0.7.6
Enhancements
- Convert fast startegy to ocr_only for images
- Adds support for page numbers in
.docxand.docwhen user or renderer
created page breaks are present. - Adds retry logic for the unstructured-ingest Biomed connector
Features
- Provides users with the ability to extract additional metadata via regex.
- Updates
partition_docxto include headers and footers in the output. - Create
partition_tsvand associated tests. Make additional changes todetect_filetype.
Fixes
- Remove fake api key in test
partition_via_apisince we now require valid/empty api keys - Page number defaults to
Noneinstead of1when page number is not present in the metadata.
A page number ofNoneindicates that page numbers are not being tracked for the document
or that page numbers do not apply to the element in question.. - Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
in case the shape.top and shape.left attributes areNone.