Skip to content

adding support for OCR in read_file#170

Merged
mikegros merged 7 commits intolanl:mainfrom
ndebard:ndebard/ocr-read-file
Feb 4, 2026
Merged

adding support for OCR in read_file#170
mikegros merged 7 commits intolanl:mainfrom
ndebard:ndebard/ocr-read-file

Conversation

@ndebard
Copy link
Collaborator

@ndebard ndebard commented Feb 1, 2026

This PR adds OCR support for PDFs in the read_file tool - I came across this on some private internal docs. I also added unit tests that generate a PDF, saves it as an image, and then uses the functionality to read it back in using the OCR capability. That particular test requires pip install reportlab so it'll skip it if that's not installed. it's possible that needs to be added or only added to the dev package? I'm unsure... I just needed this functionality.

@ndebard
Copy link
Collaborator Author

ndebard commented Feb 1, 2026

I added a README.md that explain this. There's a few features you'd need to add to your Mac / env to do OCR (obviously), but they ARE NOT NEEDED for everyone, only if you want this feature (explained in that README.md - should we include these in docs somewhere?). Anyway, you can see examples of this working in that README. It's really rather cool - it tries to read a doc, sees it can't find any text regions, tries a fast OCR, tests the text region, if THAT fails, it goes for a big / timely (~15 seconds in my testing) OCR and then it reads THAT text region.

All that works and it returns it via the read_file tool. Once tthe OCR file exists, it won't do it again (pay the cost again).

This PR is ready for review.

@mikegros mikegros self-requested a review February 1, 2026 23:07
@jsta
Copy link
Member

jsta commented Feb 2, 2026

I am also interested in OCR capabilities. I wondered if URSA could support this via an MCP server. So the procedure would be to specify something like ursa_ocr_mcp_http in a yaml file that matches the MCP OCR service that you spin-up separately (or is otherwise provided by an always-on service). Afaik, ursa only has stdio type MCP implementations not remote (http) ones?

@ndebard
Copy link
Collaborator Author

ndebard commented Feb 2, 2026

I am also interested in OCR capabilities. I wondered if URSA could support this via an MCP server. So the procedure would be to specify something like ursa_ocr_mcp_http in a yaml file that matches the MCP OCR service that you spin-up separately (or is otherwise provided by an always-on service). Afaik, ursa only has stdio type MCP implementations not remote (http) ones?

This is a great idea. I'd prefer to keep this separate for now so we can move fast and start using OCR work and then add this as a task we should do later - you're definitely right that supporting this via MCP makes sense!

@mikegros
Copy link
Collaborator

mikegros commented Feb 3, 2026

I want to merge this but am having a couple tests fail.

The test test_ocr_cache_skips_second_run fails at the first assert here:

        print("EXTRACTED_LEN:", len(out))
        print("EXTRACTED_PREVIEW:", out[:300])
        print(out)
    
>       assert len(out) == 5000
E       AssertionError: assert 4 == 5000

where the printed text is:

EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tiny

The test test_ocr_runs_and_uses_ocr_pdf fails similarly:

        print("EXTRACTED_LEN:", len(out))
        print("EXTRACTED_PREVIEW:", out[:300])
        print(out)
    
>       assert out.startswith("OCR_TEXT_")
E       AssertionError: assert False

where the printed text is:

EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tiny

both cases have a function that returns differently if the path ends with .ocr.pdf

For instance, in the second case above:

        # Original tiny, OCR big
        def fake_read_pdf_text(path: str) -> str:
            return "tiny" if not path.endswith(".ocr.pdf") else "Z" * 5000

but the test filename seems to be scan.pdf so returns tiny not, the generated text.

Any idea what is happening there? It happened both when I ran the pytest locally and when run through the github CI action.

@ndebard
Copy link
Collaborator Author

ndebard commented Feb 3, 2026

@mikegros - please check if the 38ab6cb fixed this. thanks for finding this.

@ndebard
Copy link
Collaborator Author

ndebard commented Feb 3, 2026

I can't tell if what's failing is something that shouldn't be failing...

@mikegros
Copy link
Collaborator

mikegros commented Feb 3, 2026

I can't tell if what's failing is something that shouldn't be failing...

It looks like just the API key thing. I will test locally and if it's good, I will merge

@mikegros mikegros merged commit 38fbe3c into lanl:main Feb 4, 2026
1 of 2 checks passed
Comment on lines +30 to +32
Note that the first `[OCR]` line will only show up if the PDF reading fails and there
are no text layers discovered (this `skips` some complex / lengthy OCR techniques
and tries a quick and dirty one.).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the first `[OCR]` line will only show up if the PDF reading fails and there
are no text layers discovered (this `skips` some complex / lengthy OCR techniques
and tries a quick and dirty one.).
Note that the first `[OCR]` line will show up only if the PDF reading fails and
no text layers are discovered (this `skips` some complex / lengthy OCR
techniques and tries a quick and dirty one.).

Comment on lines +12 to +18
def _pdf_page_count(path: str) -> int:
try:
from pypdf import PdfReader

return len(PdfReader(path).pages)
except Exception:
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _pdf_page_count(path: str) -> int:
try:
from pypdf import PdfReader
return len(PdfReader(path).pages)
except Exception:
return 0
def _pdf_page_count(path: str) -> int:
from pypdf import PdfReader
return len(PdfReader(path).pages)

@ndebard pypdf is an existing dependency, so we don't need this check. Also, unless this import is slow, let's move this import to the top of the file.

Comment on lines +19 to +27


def _ocr_to_searchable_pdf(
src_pdf: str, out_pdf: str, *, mode: str = "skip"
) -> None:
# mode:
# - "skip": only OCR pages that look like they need it (your current behavior)
# - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)
cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _ocr_to_searchable_pdf(
src_pdf: str, out_pdf: str, *, mode: str = "skip"
) -> None:
# mode:
# - "skip": only OCR pages that look like they need it (your current behavior)
# - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)
cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]
def ocrmypdf_is_installed() -> bool:
return shutil.which("ocrmypdf") is not None
def _ocr_to_searchable_pdf(
src_pdf: str, out_pdf: str, *, mode: str = "skip"
) -> None:
# mode:
# - "skip": only OCR pages that look like they need it (your current behavior)
# - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)
if not ocrmypdf_is_installed():
raise ImportError(
"ocrmypdf was not found in your path. "
"See installation instructions:"
"https://github.com/ocrmypdf/OCRmyPDF?tab=readme-ov-file#installation"
)
cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check to see ocrmypdf is installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants