Skip unsupported image formats during OCR processing #1215

majiayu000 · 2025-12-30T03:37:27Z

Summary

Filters out unsupported image formats (like .emf, .wmf, .svg) before passing to tesseract
Prevents crash when OCR encounters vector or non-raster formats
Logs skipped files when realtime_progress is enabled

Supported formats

png, jpg, jpeg, gif, tiff, tif, bmp, ppm, pgm, pbm, webp

Test plan

Added unit tests for format detection logic
Tests cover supported formats, unsupported formats, and case-insensitive matching

Fixes #1108

Signed-off-by: majiayu000 1835304752@qq.com

When running OCR on library images via run_ocr_on_images(), .emf and other vector/unsupported image formats would cause tesseract to crash. This fix adds a filter to skip unsupported formats before passing them to tesseract, with informative logging when files are skipped. Supported formats: png, jpg, jpeg, gif, tiff, tif, bmp, ppm, pgm, pbm, webp Skipped formats: emf, wmf, svg, ico, and other non-raster formats Fixes llmware-ai#1108 Signed-off-by: majiayu000 <1835304752@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip unsupported image formats during OCR processing #1215

Skip unsupported image formats during OCR processing #1215

Uh oh!

majiayu000 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Skip unsupported image formats during OCR processing #1215

Are you sure you want to change the base?

Skip unsupported image formats during OCR processing #1215

Uh oh!

Conversation

majiayu000 commented Dec 30, 2025

Summary

Supported formats

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant