- Update
strategyparameter to allow'and"as input surrounding the value.
- Bump to
unstructured0.15.10 - Add
include_slide_notesparameter, indicating whether slide notes inpptandpptxfiles should be partitioned. Default isTrue. Now, when slide notes are present in the file, they will be included alongside other elements, which may shift the index numbers of non-note elements.
- Bump to
unstructured0.15.7
- Resolve NLTK CVE.
- Bump to
unstructured0.15.6
- Bump to
unstructured0.15.5
- Use the library's
detect_filetypein API to determine mimetype - Add content_type api parameter
- Bump to
unstructured0.15.1
- Remove constraint on
safetensorsthat preventing us from bumpingtransformers.
- Bump to
unstructured0.15.0
- Bump to
unstructured0.14.10
- Fix certain filetypes failing mimetype lookup in the new base image
- replace rockylinux with chainguard/wolfi as a base image for
amd64
- Bump to
unstructured0.14.6 - Bump to
unstructured-inference0.7.35
- Bump to
unstructured0.14.4 - Add handling for
pdf_infer_table_structureto reflect the "tables off by default" behavior inunstructured.
- Fix list params such as
extract_image_block_typesnot working via the python/js clients
- Allow for a different server port with the PORT variable
- Change pdf_infer_table_structure parameter from being disabled in auto strategy.
- Add support for
unique_element_idsparameter. - Add max lifetime, via MAX_LIFETIME_SECONDS env-var, to API containers
- Bump unstructured to 0.13.5
- Change default values for
pdf_infer_table_structureandskip_infer_table_types. Markpdf_infer_table_structuredeprecated. - Add support for the
starting_page_numberparam.
- Bump unstructured to 0.12.4
- Add support for both
list[str]andstrinput formats forocr_languagesparameter - Adds support for additional MIME types from
unstructured - Document the support for gzip files and add additional testing
- Bump Pydantic to 2.5.x and remove it from explicit dependencies list (will be managed by fastapi)
- Introduce Form params description in the code, which will form openapi and swagger documentation
- Roll back some openapi customizations
- Keep backward compatibility for passing parameters in form of
list[str](will not be shown in the documentation)
- Bump unstructured to 0.12.2
- Fix bug that ignored
combine_under_n_charschunking option argument.
- Add hi_res_model_name to partition and deprecate model_name
- Bump unstructured to 0.12.0
- Add support for returning extracted image blocks as base64 encoded data stored in metadata fields
- Bump unstructured to 0.11.6
- Handle invalid hi_res_model_name kwarg
- Enable self-hosted authorization using UNSTRUCTURED_API_KEY env variable
- Bump unstructured to 0.11.0
- Bump unstructured to 0.10.30
- Make sure
multipage_sectionsparam defaults totrueas per the readme - Bump unstructured to 0.10.29
- Add
max_charactersparam for chunking This param gives users additional control to "chunk" elements into larger or smallerCompositeElements - Bump unstructured to 0.10.28
- Make sure chipperv2 is called when
hi_res_model_name==chipper
- Bump unstructured to 0.10.26
- Bring parent_id metadata field back after fixing a backwards compatibility bug
- Restrict Chipper usage to one at a time. The model is very resource intense, and this will prevent issues while we improve it.
- Bump unstructured to 0.10.25
- Use a generator when splitting pdfs in parallel mode
- Add a default memory minimum for 503 check
- Fix an UnboundLocalError when an invalid docx file is caught
- Bump unstructured to 0.10.23
- Simplify the error message for BadZipFile errors
- Bump unstructured to 0.10.21
- Fix an unhandled error when a non pdf file is sent with content-type pdf
- Fix an unhandled error when a non docx file is sent with content-type docx
- Fix an unhandled error when a non-Unstructured json schema is sent
- Bump unstructured to 0.10.19
- Bump unstructured to 0.10.18
- Remove spurious whitespace in
app-start.sh. This fixes deployments in some envs such as Google Cloud Run.
- Adds
languageskwargocr_languageswill eventually be deprecated and replaced bylanguagesto specify what languages to use for OCR - Adds a startup log and other minor cleanups
- Adds
chunking_strategykwarg and associated params These params allow users to "chunk" elements into larger or smallerCompositeElements - Remove
parent_idfrom the element metadata. New metadata fields are causing errors with existing installs. We'll readd this once a fix is widely available. - Fix some pdfs incorrectly returning a file is encrypted error. The
pypdf.is_encryptedcheck caused us to return this error even if the file is readable.
- Bump unstructured to 0.10.16
- Drop
detection_class_probfrom the element metadata. This broke backwards compatibility when library users calledpartition_via_api. - Bump unstructured to 0.10.15
- Bump unstructured to 0.10.14
- Improve parallel mode retry handling
- Improve logging during error handling. We don't need to log stack traces for expected errors.
- Bump unstructured to 0.10.13
- Bump unstructured-inference to 0.5.25
- Remove dependency on unstructured-api-tools
- Add a top level error handler for more consistent response bodies
- Tesseract minor version bump to 5.3.2
- Update readme for parameter
hi_res_model_name - Fix a bug using
hi_res_model_namein parallel mode - Bump unstructured library to 0.10.12
- Bump unstructured-inference to 0.5.22
- Bump unstructured library to 0.10.8
- Bump unstructured-inference to 0.5.17
- Reject traffic when overloaded via
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB - Docker image built with Python 3.10 rather than 3.8
- Fix incorrect handling on param skip_infer_table_types
- Pin
safetensorsto fix a build error with 0.0.38
- Fix page break has None page number bug
- Bump unstructured to 0.10.5
- Bump unstructured-ingest to 0.5.15
- Fix UnboundLocalError using pdfs in parallel mode
- Bump unstructured to 0.10.4
- Fix a bug in parallel mode causing
not a valid pdferrors - Bump unstructured to 0.10.2, unstructured-inference to 0.5.13
- Bump unstructured library to 0.9.2
- Fix a misleading error in make docker-test
- Bump unstructured library to 0.9.0
- Add table support for image with parameter
skip_infer_table_types - Add support for gzipped files
- Image tweak, move application entrypoint to scripts/app-start.sh
- Throw 400 error if a PDF is password protected
- Improve logging of params to single line json
- Add support for
include_page_breaksparameter
- Support model name as api parameter
- Add retry parameters on fanout requests
- Bump unstructured library to 0.8.1
- Fix how to remove an element's coordinate information
- Add table extraction support for hi_res strategy
- Add support for
encodingparameter - Add support for
xml_keep_tagsparameter - Add env variables for additional parallel mode tweaking
- Support .msg files
- Refactor parallel mode and add smoke test
- Fix header value for api key
- Bump unstructured library to 0.7.8 for bug fixes
- Update documentation and tests for filetypes to sync with partition.auto
- Add support for .rst, .tsv, .xml
- Move PYPDF2 to pypdf since PYPDF2 is deprecated
- Add support for
ocr_onlystrategy andocr_languagesparameter - Remove building
detectron2from source in Dockerfile - Convert strategy from fast to auto for images since there is no fast strategy for images
- Bump image to use python 3.8.17 instead of 3.8.15
- Add returning text/csv to pipeline_api
- Add support for csv files
- Add parallel processing mode for pages within a pdf
- Bump version of base image to use new stable version of tesseract
- Bump to unstructured==0.7.1 for various bug fixes.
- Supports additional filetypes: epub, odt, rft
- Updating data type of optional os env var
ALLOWED_ORIGINS
- Add optional CORS to api if os env var
ALLOWED_ORIGINSis set
- Add config for unstructured.trace logger
- Fix image build steps to support detectron2 install from Mac M1/M2
- Upgrade to openssl 1.1.1 to accomodate the latest urllib3
- Bump unstructured for SpooledTemporaryFile fix
- Add msg and json types to supported
- Bump unstructured to the latest version
- Posting a bad .pdf results in a 400
- Remove coordinates field from response elements by default
- Add caching from the registry for
make docker-build - Add fix for empty content type error
- Bump unstructured-api-tools for better 'file type not supported' response messages
- Updated detectron version
- Update docker-build to use the public registry as a cache
- Adds a strategy parameter to pipeline_api
- Passing file, file_filename, and content_type to
partition
- Sensible logging config
- Minor version bump
- Minor version bump
- Updated Dockerfile for public release
- Remove rate limiting in the API
- Add file type validation via UNSTRUCTURED_ALLOWED_MIMETYPES
- Major semver route also supported: /general/v0/general
- Changed pipeline name to
pipeline-general - Changed pipeline to handle a variety of documents not just emails
- Update Dockerfile, all supported library files.
- Add sample-docs for pdf and pdf image.
- Add emails pipeline Dockerfile
- Add pipeline notebook
- Initial pipeline setup