adding support for OCR in read_file by ndebard · Pull Request #170 · lanl/ursa

ndebard · 2026-02-01T17:24:11Z

This PR adds OCR support for PDFs in the read_file tool - I came across this on some private internal docs. I also added unit tests that generate a PDF, saves it as an image, and then uses the functionality to read it back in using the OCR capability. That particular test requires pip install reportlab so it'll skip it if that's not installed. it's possible that needs to be added or only added to the dev package? I'm unsure... I just needed this functionality.

ndebard · 2026-02-01T19:01:06Z

I added a README.md that explain this. There's a few features you'd need to add to your Mac / env to do OCR (obviously), but they ARE NOT NEEDED for everyone, only if you want this feature (explained in that README.md - should we include these in docs somewhere?). Anyway, you can see examples of this working in that README. It's really rather cool - it tries to read a doc, sees it can't find any text regions, tries a fast OCR, tests the text region, if THAT fails, it goes for a big / timely (~15 seconds in my testing) OCR and then it reads THAT text region.

All that works and it returns it via the read_file tool. Once tthe OCR file exists, it won't do it again (pay the cost again).

This PR is ready for review.

jsta · 2026-02-02T16:01:25Z

I am also interested in OCR capabilities. I wondered if URSA could support this via an MCP server. So the procedure would be to specify something like ursa_ocr_mcp_http in a yaml file that matches the MCP OCR service that you spin-up separately (or is otherwise provided by an always-on service). Afaik, ursa only has stdio type MCP implementations not remote (http) ones?

ndebard · 2026-02-02T17:29:04Z

I am also interested in OCR capabilities. I wondered if URSA could support this via an MCP server. So the procedure would be to specify something like ursa_ocr_mcp_http in a yaml file that matches the MCP OCR service that you spin-up separately (or is otherwise provided by an always-on service). Afaik, ursa only has stdio type MCP implementations not remote (http) ones?

This is a great idea. I'd prefer to keep this separate for now so we can move fast and start using OCR work and then add this as a task we should do later - you're definitely right that supporting this via MCP makes sense!

mikegros · 2026-02-03T06:55:37Z

I want to merge this but am having a couple tests fail.

The test test_ocr_cache_skips_second_run fails at the first assert here:

        print("EXTRACTED_LEN:", len(out))
        print("EXTRACTED_PREVIEW:", out[:300])
        print(out)
    
>       assert len(out) == 5000
E       AssertionError: assert 4 == 5000

where the printed text is:

EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tiny

The test test_ocr_runs_and_uses_ocr_pdf fails similarly:

        print("EXTRACTED_LEN:", len(out))
        print("EXTRACTED_PREVIEW:", out[:300])
        print(out)
    
>       assert out.startswith("OCR_TEXT_")
E       AssertionError: assert False

where the printed text is:

EXTRACTED_LEN: 4
EXTRACTED_PREVIEW: tiny
tiny

both cases have a function that returns differently if the path ends with .ocr.pdf

For instance, in the second case above:

        # Original tiny, OCR big
        def fake_read_pdf_text(path: str) -> str:
            return "tiny" if not path.endswith(".ocr.pdf") else "Z" * 5000

but the test filename seems to be scan.pdf so returns tiny not, the generated text.

Any idea what is happening there? It happened both when I ran the pytest locally and when run through the github CI action.

ndebard · 2026-02-03T17:14:22Z

@mikegros - please check if the 38ab6cb fixed this. thanks for finding this.

ndebard · 2026-02-03T17:33:19Z

I can't tell if what's failing is something that shouldn't be failing...

mikegros · 2026-02-03T18:56:05Z

I can't tell if what's failing is something that shouldn't be failing...

It looks like just the API key thing. I will test locally and if it's good, I will merge

luiarthur · 2026-02-04T21:44:12Z

examples/single_agent_examples/hypothesizer_agent/README.md

+Note that the first `[OCR]` line will only show up if the PDF reading fails and there
+are no text layers discovered (this `skips` some complex / lengthy OCR techniques
+and tries a quick and dirty one.).


Suggested change

Note that the first `[OCR]` line will only show up if the PDF reading fails and there

are no text layers discovered (this `skips` some complex / lengthy OCR techniques

and tries a quick and dirty one.).

Note that the first `[OCR]` line will show up only if the PDF reading fails and

no text layers are discovered (this `skips` some complex / lengthy OCR

techniques and tries a quick and dirty one.).

examples/single_agent_examples/hypothesizer_agent/README.md

luiarthur · 2026-02-04T21:49:33Z

src/ursa/tools/read_file_tool.py

+def _pdf_page_count(path: str) -> int:
+    try:
+        from pypdf import PdfReader
+
+        return len(PdfReader(path).pages)
+    except Exception:
+        return 0


Suggested change

def _pdf_page_count(path: str) -> int:

try:

from pypdf import PdfReader

return len(PdfReader(path).pages)

except Exception:

return 0

def _pdf_page_count(path: str) -> int:

from pypdf import PdfReader

return len(PdfReader(path).pages)

@ndebard pypdf is an existing dependency, so we don't need this check. Also, unless this import is slow, let's move this import to the top of the file.

luiarthur · 2026-02-04T22:10:33Z

src/ursa/tools/read_file_tool.py

+
+
+def _ocr_to_searchable_pdf(
+    src_pdf: str, out_pdf: str, *, mode: str = "skip"
+) -> None:
+    # mode:
+    #  - "skip":  only OCR pages that look like they need it (your current behavior)
+    #  - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)
+    cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]


Suggested change

def _ocr_to_searchable_pdf(

src_pdf: str, out_pdf: str, *, mode: str = "skip"

) -> None:

# mode:

# - "skip": only OCR pages that look like they need it (your current behavior)

# - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)

cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]

def ocrmypdf_is_installed() -> bool:

return shutil.which("ocrmypdf") is not None

def _ocr_to_searchable_pdf(

src_pdf: str, out_pdf: str, *, mode: str = "skip"

) -> None:

# mode:

# - "skip": only OCR pages that look like they need it (your current behavior)

# - "force": rasterize + OCR everything (fixes vector/outlined “no images” PDFs)

if not ocrmypdf_is_installed():

raise ImportError(

"ocrmypdf was not found in your path. "

"See installation instructions:"

"https://github.com/ocrmypdf/OCRmyPDF?tab=readme-ov-file#installation"

)

cmd = ["ocrmypdf", "--rotate-pages", "--deskew", "--clean"]

We should check to see ocrmypdf is installed.

ndebard added 3 commits February 1, 2026 10:22

adding support for OCR in read_file

bff9464

forgot to add this

4f35fab

Improve OCR handling in read_file tool and add OCR README

faf04be

mikegros self-requested a review February 1, 2026 23:07

Fix OCR tool invocation tests; fall back on OCR errors

38ab6cb

ndebard added 2 commits February 3, 2026 10:28

ruff fix

0f17cca

ruff fix 2 lol

e278ad7

Merging in changes to this and hopefully is ready to go. Testing now.

55549a0

mikegros merged commit 38fbe3c into lanl:main Feb 4, 2026
1 of 2 checks passed

luiarthur reviewed Feb 4, 2026

View reviewed changes

mikegros mentioned this pull request Feb 4, 2026

Updates to address review to the other PR #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding support for OCR in read_file#170

adding support for OCR in read_file#170
mikegros merged 7 commits intolanl:mainfrom
ndebard:ndebard/ocr-read-file

ndebard commented Feb 1, 2026

Uh oh!

ndebard commented Feb 1, 2026

Uh oh!

jsta commented Feb 2, 2026

Uh oh!

ndebard commented Feb 2, 2026

Uh oh!

mikegros commented Feb 3, 2026

Uh oh!

ndebard commented Feb 3, 2026

Uh oh!

ndebard commented Feb 3, 2026

Uh oh!

mikegros commented Feb 3, 2026

Uh oh!

Uh oh!

luiarthur Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

luiarthur Feb 4, 2026

Uh oh!

luiarthur Feb 4, 2026

Uh oh!

luiarthur Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ndebard commented Feb 1, 2026

Uh oh!

ndebard commented Feb 1, 2026

Uh oh!

jsta commented Feb 2, 2026

Uh oh!

ndebard commented Feb 2, 2026

Uh oh!

mikegros commented Feb 3, 2026

Uh oh!

ndebard commented Feb 3, 2026

Uh oh!

ndebard commented Feb 3, 2026

Uh oh!

mikegros commented Feb 3, 2026

Uh oh!

Uh oh!

luiarthur Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

luiarthur Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

luiarthur Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

luiarthur Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants