Retrieve all the versions of a PDF when saved incrementally #3096

3051360 · 2024-01-24T17:02:03Z

3051360
Jan 24, 2024

I've observed that PyMuPDF provides functionality to determine the number of versions present in a document.

Currently, I'm utilizing pdfresurrect, a C tool/library designed to "extract older 'hidden' versions of a PDF from the current PDF."

While my larger application is built using PyMuPDF, I have to call the compiled C code through Python for version retrieval, which is not an ideal way of building applications.

I propose that enhancing the capability to retrieve all PDF version files directly within PyMuPDF would be beneficial.

cipri-tom · 2024-01-24T19:13:37Z

cipri-tom
Jan 24, 2024

Here's how I do it:

        pattern = re.compile(b'[^t]xref.*?EOF', flags=re.M+re.S)

        # `inc_nb` means the number of incremental update contained in the binary
        matches = list(pattern.finditer(doc.data))
        inc_nb = len(matches)
        if not inc_nb > 1:
            return result

        # here we loop to separate each incremental update in
        # the binary to retrieve each version of the file
        # We don't take into account the current version
        for sub in matches[:-1]:
            stream = doc.data[:sub.span()[1]] # binary code of a version
            if not stream:
                continue
            previous = Doc(stream).as_fitz() # is one of the previous version of the file

2 replies

3051360 Jan 28, 2024
Author

@cipri-tom Thanks for sharing the approach. I get the essence of what you are trying to do. I think you have provided a part of the code which at the end could result in getting all the versions of an incrementally saved PDF file.

I attempted to complete the overall code (with some dummy code in between) but obviously did not succeed.

import fitz
import re
import os

def extract_versions(input_pdf_path, output_folder):
    doc = fitz.open(input_pdf_path)

    pattern = re.compile(b'[^t]xref.*?EOF', flags=re.M + re.S)

    matches = list(pattern.finditer(doc.get_raw_data()))

    if len(matches) <= 1:
        print("Only one version found. No incremental updates.")
        return

    os.makedirs(output_folder, exist_ok=True)

    for i, sub in enumerate(matches[:-1]):
        start_offset, end_offset = sub.span()
        version_data = doc.get_raw_data()[start_offset:end_offset]

        version_doc = fitz.open()

        version_doc.insert_pdf(fitz.open(fitz.Stream(fitz.SOString, version_data)))

        output_path = os.path.join(output_folder, f"version_{i + 1}.pdf")
        version_doc.save(output_path)
        print(f"Version {i + 1} saved to: {output_path}")

        version_doc.close()

    doc.close()

I am not sure how to proceed further but maybe someone looking for a similar feature could find this code useful.

As @JorjMcKie mentioned in the later thread, I will drop a request to people at MuPDF for this feature.

cipri-tom Feb 10, 2024

the problem here is that you get version_data from start_offset to end_offset. Instead, you should always get it from the start, until the end_offset. This is why it is "incremental", new data is appended, working in addition to existing data.

JorjMcKie · 2024-01-25T10:42:35Z

JorjMcKie
Jan 25, 2024
Maintainer

This feature would have to be implemented by our base library MuPDF to make any sense.
Currently, you can find out the number of versions of a PDF: doc.version_count only.

Please talk to the MuPDF colleagues on e.g. their Dicscord channel about options they may see.

1 reply

3051360 Jan 28, 2024
Author

@JorjMcKie Thank you for your response, I will reach out to people at MuPDF to see if they could point me to something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retrieve all the versions of a PDF when saved incrementally #3096

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Retrieve all the versions of a PDF when saved incrementally #3096

Uh oh!

3051360 Jan 24, 2024

Replies: 2 comments · 3 replies

Uh oh!

cipri-tom Jan 24, 2024

Uh oh!

3051360 Jan 28, 2024 Author

Uh oh!

cipri-tom Feb 10, 2024

Uh oh!

JorjMcKie Jan 25, 2024 Maintainer

Uh oh!

3051360 Jan 28, 2024 Author

3051360
Jan 24, 2024

Replies: 2 comments 3 replies

cipri-tom
Jan 24, 2024

3051360 Jan 28, 2024
Author

JorjMcKie
Jan 25, 2024
Maintainer

3051360 Jan 28, 2024
Author