Split single PDF into multiple ones using a delimiter page #459

denvercoder21 · 2021-01-10T22:19:41Z

denvercoder21
Jan 10, 2021

I was in the course of implementing a tool for archiving and auto-uploading my documents before I stumbled across paperless, which is much more powerful and mature than my little pet project.

However, I can feed a stack of paper consisting of multiple documents (up to my device's 50 page limit) to my scanner which gets uploaded to the inbox folder of my server as a single file (of course). My server application then searches through the pages of the file for a certain page containing a QR code. This position is used to cut the large files into separate documents again.

Example: (numbers are page numbers)

1 doc 1
2 doc 1
3 doc 1
4 QR page
5 doc 2
6 QR page
7 doc 3
8 doc 3

Will be cut into:

1 doc 1
2 doc 1
3 doc 1

1 doc 2

1 doc 3
2 doc 3

Support for this by paperless would be cool! My code is here:
https://github.com/denvercoder21/split-pdf/blob/main/split.py

I haven't taken the time yet to look through paperless' code, so I can't tell yet whether I'm confident creating a PR myself.

Let me know what you think!

jonaswinkler · 2021-01-11T11:31:18Z

jonaswinkler
Jan 11, 2021
Maintainer

You scan the same qr code page in between documents, right? Sounds useful to me.

However, its pretty hard to get that functionality into paperless. Let me elaborate on how the consumption pipeline works real quick:

In general, Paperless is able to consume any files. Different file types are supported by different "parsers", which convert files into text. There's one parser for PDF documents, which performs OCR.
After a file has been found in the consumption directory / uploaded on the dashboard / found in an email attachment, they're sent to the consumer.
The consumer checks the file type, loads the appropriate parser (or cancels if the file is not supported), and calls the parser to translate that file into text.
These parsers are designed to process one file at a time, and produce exactly one document for paperless.

The issue is as follows: The only place where we're sure we're dealing with PDF documents (and not text files / office documents) is inside the PDF parser. However, at that place, we're limited to producing exactly one document. Changing that requires many changes to how the consumption pipeline works, invalidates many test cases, etc. The key file is documents/consumer.py, and the method is try_consume_file.

Adding that to the consumption folder watcher (management/document_consumer.py, I need to rename that) is possible, but sounds like a rather special feature.

I've got a better idea:

Make this a stand-alone script, that continuously watches a specified folder (just as paperless does). It checks each file for its type, and if it's a PDF, performs the slicing operations, and moves the resulting files into the consumption folder of paperless. If it's not a PDF, it simply moves the file into the consumption directory.
Make a docker image, so that people can add that easily to their compose files like this:

services:

  webserver:
    image: jonaswinkler/paperless-ng:latest
    ...
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - consume-internal:/usr/src/paperless/consume
  
  qrcode-splitter:
    image: someone/paperless-qrcode-splitter
    restart: unless-stopped
    volumes:
      - consume-internal:/output
      - ./consume:/input
    environment:
      QR_MAGIC_CONTENT: The decoded content of the qr code used to split documents

volumes:
  consume-internal:
  data:
  media:

The key here is that both containers would "communicate" through that internal consumption folder.

I'm not doing that though, but I can give some directions and hints if someone wants to take a stab at it.

0 replies

mandomal · 2021-01-14T06:03:12Z

mandomal
Jan 14, 2021

What a coincidence. I just made some code to do just this today. However, my delimiter is just a blank page. So far it works fine, but my code needs some improvement. I don't want to make this a ongoing project for myself so once I've made it usable for myself I'll upload and link the project for anyone to fork and improve upon.

0 replies

axlgit · 2021-04-02T11:47:00Z

axlgit
Apr 2, 2021

Why not have different QR codes instruct Paperless-ng on what to do on the next pages? In case of options not understood the default may apply. We would need a web page where to download/print the QR codes understood by that PL-ng version.

0 replies

Mannshoch · 2021-05-03T15:24:29Z

Mannshoch
May 3, 2021

May I hook into this discussion. I'm not a developer but would it may be an Idea to do this:
I would propose to create two workflows with some points where we could set up some own scripts.

4 replies

allFunAndGames Feb 1, 2022

Curious if this discussion was ever taken as far as anyone releasing some code?

My thought was that paperless-ng already has the feature for a pre-consumption script, couldn't this call the seperator which then writes successive documents into the consumtion folder, leaving only the first document's pages in the original file. This then goes for processing, whereupon the consumer grabs the next file (not caring whether we dumped the file there or our script did) and the process repeats. Perhaps to save time not trying to 'resplit' already split documents a tag could be added to the pdf, or perhaps a suffix to the filename

pkoerner81929 Feb 1, 2022

You may have a look at #1320

I have implemented a local solution to do the split based on QR code sticker. "Works for me" but may be adoptable for pre/post scripts etc. I am referring to some existing solutions (scanbd, pdf-splitter.sh, sane-scan-pdf etc.).

allFunAndGames Feb 1, 2022

excellent, thanks Peter! will have a look now, much appreciated

torwag Feb 4, 2022

I have something almost ready .... it will be an additional docker-service, which splits documents and removes empty pages. One could easily add this to the data input pipe by adding it in front of paperless ng.
I still struggle with the docker part a bit. If ready, I would like to release it into the wild.
It somehow has an open architecture to perform other tasks as well, image processing, adding watermarks, etc.

Nevertheless, I have noticed that I have far less use for it as originally thought. Most of the time I simply scan document by document, which doesn't take much longer compared to scanning all in one go but adding and removing separator pages.

torwag · 2022-02-04T14:33:12Z

torwag
Feb 4, 2022

I might also like to add to the discussions, that adding something to pages (stickers, etc.) most likely will trigger the fault detection of ADF scanners. Thus, the only way I saw was to print out sheets with a QR code.
I shortly thought about being able to command what to do, by using different QR codes (like tell paperless to add a tag, correspondent, etc.) but

As the tool is external, it is more tricky to communicate with paperless, as at the stage of processing, paperless has still no clue about incoming documents.
I feel this overcomplicates daily handling. If I have to select between several separator pages and remove and sort them later on, I might invest more time in doing the preparation and post-processes dealing with the papers, rather than simply tagging, the docs when they arrived in paperless.
Vice versa, I thought about controlling the scanner: Having a separator page that tells the scanner to scan the next pages until the next separator page with certain settings (colour, bw, greyscale, double-sided, single sided, resolution, etc.) However, that solution would require a very specific code towards a particular scanner, definitely something which would not work anywhere else.
Finally, I came to the conclusion to stick to simple separator pages for now.
Maybe a future version would allow controlling the processing of the files itself, like removing blank pages "on"/"off', additional image processing, etc.

0 replies

Split single PDF into multiple ones using a delimiter page #459

Uh oh!

Replies: 5 comments · 4 replies

Uh oh!

Uh oh!

jonaswinkler Jan 11, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 4 replies

jonaswinkler
Jan 11, 2021
Maintainer