Smart way to remove appendix from pdf #3881
Unanswered
paulgekeler
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a dataset of pdfs of scientific papers, of which multiple hundred contain appendices. I would like to limit the document length by removing these. As almost all appendices have a heading of some sort starting with section "A [....]" and so on, the easiest way to go about this would be to search for a text pattern matching this heading. Then remove all content and pages after. However, I am sceptical it would work robustly with inconsistent spacing, fonts etc. Is this assumption wrong? Also some of the following (attached pdf page) scenarios are fairly common.

Is there an obvious way to do this that I missed in the documentation? Has anybody else done this successfully?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions