v0.3.6

rstrahan released this 03 Jul 22:15

· 3289 commits to main since this release

44e2df7

[0.3.6]

Fixed

Update Athena/Glue table configuration to use Parquet format instead of JSON #20
Cloudformation Error when Changing Evaluation Bucket Name #19

Added

Extended Document Format Support in OCR Service
- Added support for processing additional document formats beyond PDF and images:
  - Plain text (.txt) files with automatic pagination for large documents
  - CSV (.csv) files with table visualization and structured output
  - Excel workbooks (.xlsx, .xls) with multi-sheet support (each sheet as a page)
  - Word documents (.docx, .doc) with text extraction and visual representation
- Key Features:
  - Consistent processing model across all document formats
  - Standard page image generation for all formats
  - Structured text output in formats compatible with existing extraction pipelines
  - Confidence metrics for all document types
  - Automatic format detection from file content and extension
- Implementation Details:
  - Format-specific processing strategies for optimal results
  - Enhanced text rendering for plain text documents
  - Table visualization for CSV and Excel data
  - Word document paragraph extraction with formatting preservation
  - S3 storage integration matching existing PDF processing workflow

Assets 2