Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.7.5
0.7.5
Enhancements
- Adds functionality to sort elements in
partition_pdfforfaststrategy - Adds ingest tests with
--faststrategy on PDF documents - Adds --api-key to unstructured-ingest
Features
- Adds
partition_rstfor processed ReStructured Text documents.
Fixes
- Adds handling for emails that do not have a datetime to extract.
- Adds pdf2image package as core requirement of unstructured (with no extras)
0.7.4
0.7.4
Enhancements
- Allows passing kwargs to request data field for
partition_via_apiandpartition_multiple_via_api - Enable MIME type detection if libmagic is not available
- Adds handling for empty files in
detect_filetypeandpartition.
Features
Fixes
- Reslove
grpcioimport issue onweaviate.schema.validate_schemafor python 3.9 and 3.10 - Remove building
detectron2from source in Dockerfile
0.7.3
0.7.3
Enhancements
- Update IngestDoc abstractions and add data source metadata in ElementMetadata
Features
Fixes
- Pass
strategyparameter down frompartitionforpartition_image - Filetype detection if a CSV has a
text/plainMIME type convert_office_docno longers prints file conversion info messages to stdout.partition_via_apireflects the actual filetype for the file processed in the API.
0.7.2
0.7.2
Enhancements
- Adds an optional encoding kwarg to
elements_to_jsonandelements_from_json - Bump version of base image to use new stable version of tesseract
Features
Fixes
- Update the
read_txt_fileutility function to keep usingspooled_to_bytes_io_if_neededfor xml - Add functionality to the
read_txt_fileutility function to handle file-like object from URL - Remove the unused parameter
encodingfrompartition_pdf - Change auto.py to have a
Nonedefault for encoding - Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
- Adds benchmark test with test docs in example-docs
- Re-enable test_upload_label_studio_data_with_sdk
- File detection now detects code files as plain text
- Adds
tabulateexplicitly to dependencies - Fixes an issue in
metadata.page_numberof pptx files - Adds showing help if no parameters passed
0.7.1
0.7.1
Enhancements
Features
- Add
stage_for_weaviateto stageunstructuredoutputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas. - Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
Fixes
0.7.0
0.7.0
Enhancements
- Installing
detectron2from source is no longer required when using thelocal-inferenceextra. - Updates
.pptxparsing to include text in tables.
Features
Fixes
- Fixes an issue in
_add_element_metadatathat caused all elements to havepage_number=1
in the element metadata. - Adds
.logas a file extension for TXT files. - Adds functionality to try other common encodings for email (
.eml) files if an error related to the encoding is raised and the user has not specified an encoding. - Allow passed encoding to be used in the
replace_mime_encodings - Fixes page metadata for
partition_htmlwheninclude_metadata=False - A
ValueErrornow raises iffile_filenameis not specified when you usepartition_via_api
with a file-like object.
0.6.11
0.6.11
Enhancements
- Supports epub tests since pandoc is updated in base image
Features
Fixes
0.6.10
0.6.9
0.6.9
Enhancements
- fast strategy for pdf now keeps element bounding box data
- setup.py refactor
Features
Fixes
- Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
- Adds additional MIME types for CSV
0.6.8
0.6.8
Enhancements
Features
- Add
partition_csvfor CSV files.