adding support for OCR in read_file#170
Conversation
|
I added a README.md that explain this. There's a few features you'd need to add to your Mac / env to do OCR (obviously), but they ARE NOT NEEDED for everyone, only if you want this feature (explained in that README.md - should we include these in docs somewhere?). Anyway, you can see examples of this working in that README. It's really rather cool - it tries to read a doc, sees it can't find any text regions, tries a fast OCR, tests the text region, if THAT fails, it goes for a big / timely (~15 seconds in my testing) OCR and then it reads THAT text region. All that works and it returns it via the read_file tool. Once tthe OCR file exists, it won't do it again (pay the cost again). This PR is ready for review. |
|
I am also interested in OCR capabilities. I wondered if URSA could support this via an MCP server. So the procedure would be to specify something like |
This is a great idea. I'd prefer to keep this separate for now so we can move fast and start using OCR work and then add this as a task we should do later - you're definitely right that supporting this via MCP makes sense! |
|
I want to merge this but am having a couple tests fail. The test print("EXTRACTED_LEN:", len(out))
print("EXTRACTED_PREVIEW:", out[:300])
print(out)
> assert len(out) == 5000
E AssertionError: assert 4 == 5000where the printed text is: EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tinyThe test print("EXTRACTED_LEN:", len(out))
print("EXTRACTED_PREVIEW:", out[:300])
print(out)
> assert out.startswith("OCR_TEXT_")
E AssertionError: assert Falsewhere the printed text is: EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tinyboth cases have a function that returns differently if the path ends with For instance, in the second case above: # Original tiny, OCR big
def fake_read_pdf_text(path: str) -> str:
return "tiny" if not path.endswith(".ocr.pdf") else "Z" * 5000but the test filename seems to be Any idea what is happening there? It happened both when I ran the pytest locally and when run through the github CI action. |
|
I can't tell if what's failing is something that shouldn't be failing... |
It looks like just the API key thing. I will test locally and if it's good, I will merge |
| Note that the first `[OCR]` line will only show up if the PDF reading fails and there | ||
| are no text layers discovered (this `skips` some complex / lengthy OCR techniques | ||
| and tries a quick and dirty one.). |
There was a problem hiding this comment.
| Note that the first `[OCR]` line will only show up if the PDF reading fails and there | |
| are no text layers discovered (this `skips` some complex / lengthy OCR techniques | |
| and tries a quick and dirty one.). | |
| Note that the first `[OCR]` line will show up only if the PDF reading fails and | |
| no text layers are discovered (this `skips` some complex / lengthy OCR | |
| techniques and tries a quick and dirty one.). |
| def _pdf_page_count(path: str) -> int: | ||
| try: | ||
| from pypdf import PdfReader | ||
|
|
||
| return len(PdfReader(path).pages) | ||
| except Exception: | ||
| return 0 |
There was a problem hiding this comment.
| def _pdf_page_count(path: str) -> int: | |
| try: | |
| from pypdf import PdfReader | |
| return len(PdfReader(path).pages) | |
| except Exception: | |
| return 0 | |
| def _pdf_page_count(path: str) -> int: | |
| from pypdf import PdfReader | |
| return len(PdfReader(path).pages) |
@ndebard pypdf is an existing dependency, so we don't need this check. Also, unless this import is slow, let's move this import to the top of the file.
|
|
||
|
|
||
| def _ocr_to_searchable_pdf( | ||
| src_pdf: str, out_pdf: str, *, mode: str = "skip" | ||
| ) -> None: | ||
| # mode: | ||
| # - "skip": only OCR pages that look like they need it (your current behavior) | ||
| # - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs) | ||
| cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"] |
There was a problem hiding this comment.
| def _ocr_to_searchable_pdf( | |
| src_pdf: str, out_pdf: str, *, mode: str = "skip" | |
| ) -> None: | |
| # mode: | |
| # - "skip": only OCR pages that look like they need it (your current behavior) | |
| # - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs) | |
| cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"] | |
| def ocrmypdf_is_installed() -> bool: | |
| return shutil.which("ocrmypdf") is not None | |
| def _ocr_to_searchable_pdf( | |
| src_pdf: str, out_pdf: str, *, mode: str = "skip" | |
| ) -> None: | |
| # mode: | |
| # - "skip": only OCR pages that look like they need it (your current behavior) | |
| # - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs) | |
| if not ocrmypdf_is_installed(): | |
| raise ImportError( | |
| "ocrmypdf was not found in your path. " | |
| "See installation instructions:" | |
| "https://github.com/ocrmypdf/OCRmyPDF?tab=readme-ov-file#installation" | |
| ) | |
| cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"] |
There was a problem hiding this comment.
We should check to see ocrmypdf is installed.
This PR adds OCR support for PDFs in the read_file tool - I came across this on some private internal docs. I also added unit tests that generate a PDF, saves it as an image, and then uses the functionality to read it back in using the OCR capability. That particular test requires
pip install reportlabso it'll skip it if that's not installed. it's possible that needs to be added or only added to the dev package? I'm unsure... I just needed this functionality.