Skip to content

1552 implement needs ocr in sample caller#1554

Open
Luis-manzur wants to merge 8 commits intomainfrom
1552-implement-needs_ocr-in-sample-caller
Open

1552 implement needs ocr in sample caller#1554
Luis-manzur wants to merge 8 commits intomainfrom
1552-implement-needs_ocr-in-sample-caller

Conversation

@Luis-manzur
Copy link
Copy Markdown
Contributor

This pull request adds a new --ocr-available flag to the sample_caller script, enhancing its ability to detect when OCR (Optical Character Recognition) should be used for document extraction. It introduces a new utility module for robustly detecting when OCR is needed, and updates the workflow to leverage this logic when the flag is set.

this PR addresses -- #1552

@Luis-manzur Luis-manzur requested a review from grossir August 21, 2025 16:06
@Luis-manzur Luis-manzur linked an issue Aug 21, 2025 that may be closed by this pull request
@Luis-manzur Luis-manzur requested a review from flooie August 21, 2025 16:06
@Luis-manzur Luis-manzur moved this to PRs to Review in Sprint (Case Law) Aug 21, 2025
@flooie
Copy link
Copy Markdown
Contributor

flooie commented Aug 22, 2025

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

@flooie flooie assigned Luis-manzur and unassigned flooie Aug 22, 2025
@Luis-manzur
Copy link
Copy Markdown
Contributor Author

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

I added it so I could test certain scenarios, such as in texbizct, where in some PDFs much of the information is in images rather than plain text.

So this is to simulate CL's behavior when necessary.

ocr_utils and the functions within it are the same functions initially used in CL to detect whether OCR should be used or not. The problem with that implementation is that its use case is for PACER, which is why I added the option to detect missing pages, what I consider a more general approach.

These changes should also be applied in CL

@Luis-manzur Luis-manzur requested a review from Copilot August 22, 2025 15:43

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: PRs to Review

Development

Successfully merging this pull request may close these issues.

Implement needs_ocr in sample caller

3 participants