Skip to content

Conversation

@majiayu000
Copy link

Summary

  • Filters out unsupported image formats (like .emf, .wmf, .svg) before passing to tesseract
  • Prevents crash when OCR encounters vector or non-raster formats
  • Logs skipped files when realtime_progress is enabled

Supported formats

png, jpg, jpeg, gif, tiff, tif, bmp, ppm, pgm, pbm, webp

Test plan

  • Added unit tests for format detection logic
  • Tests cover supported formats, unsupported formats, and case-insensitive matching

Fixes #1108

Signed-off-by: majiayu000 1835304752@qq.com

When running OCR on library images via run_ocr_on_images(), .emf and
other vector/unsupported image formats would cause tesseract to crash.

This fix adds a filter to skip unsupported formats before passing them
to tesseract, with informative logging when files are skipped.

Supported formats: png, jpg, jpeg, gif, tiff, tif, bmp, ppm, pgm, pbm, webp
Skipped formats: emf, wmf, svg, ico, and other non-raster formats

Fixes llmware-ai#1108

Signed-off-by: majiayu000 <1835304752@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant