Extracting dates from a PDF #3354

mindphil · 2025-07-02T20:04:52Z

mindphil
Jul 2, 2025

Hello, working on extracting dates from PDF's, using PyPDF2 and re (for adding exceptions). What are some clever ways to increase the accuracy of strings extracted? Running into issues getting false positives and some dates being extracted that aren't even in the PDF to begin with.

stefan6419846 · 2025-07-03T07:48:54Z

stefan6419846
Jul 3, 2025
Maintainer

As a first step, consider switching to pypdf, as PyPDF2 is EOL. Further recommendations are hard to give in general, but it might help to know the approximated location on the page, combined with a text visitor function to only retrieve the relevant values.

1 reply

mindphil Jul 3, 2025
Author

Switching the pypdf was the right call. Made some major headway by calling previous functions that identified the document type. If it is x, the right date is usually right in the header, so we read only the first page and extract the first 200 characters. If it's y, it's generally on the last page.

Already had a comprehensive (as I could determine) date string patterns, but only having to apply that to a smaller selection was huge for accuracy. I don't think it is really helpful but wrote a function to ignore obviously incorrect dates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extracting dates from a PDF #3354

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extracting dates from a PDF #3354

Uh oh!

mindphil Jul 2, 2025

Replies: 1 comment · 1 reply

Uh oh!

stefan6419846 Jul 3, 2025 Maintainer

Uh oh!

mindphil Jul 3, 2025 Author

mindphil
Jul 2, 2025

Replies: 1 comment 1 reply

stefan6419846
Jul 3, 2025
Maintainer

mindphil Jul 3, 2025
Author