pyPDF2 to PyMuPDF #2327

vicrem · 2023-04-09T09:19:57Z

vicrem
Apr 9, 2023

Hi everyone,

Can someone plz explain/show me how I can rewrite following code to work in PyMuPDF? I have tried with widgets but not fully happy.

def sort_fields(self, field):
    
  form = field.get_object().get('/Kids')

  if form:

    return [self.sort_fields(f) for f in form]

  else:

    return (field.get_object().get('/V'))




def main(self):
    
  _pdf = PdfReader(self.pdf_file)
  
  fields = _pdf.trailer['/Root']['/AcroForm']['/Fields']


  for field in fields:
    _data = self.sort_fields(field)
    ...

Thx in advanced
//Victor

Answered by JorjMcKie

Apr 9, 2023

The issue here is that in PyMuPDF fields / widgets are kids of pages. To simulate pypdf2's behavior, one must use PyMuPDF's low-level functions - which still look similar enough to pypdf2.

With the low-level functions, you can access all of a PDF's object directly in a syntax close to PDF source code. So your above code would look like this:

import fitz, sys

def sort_fields(doc, field_xref):  # invoke with a field's cross ref number
    kids = doc.xref_get_key(field_xref, "Kids")  # are there kids?
    if kids[0] == "array":  # extract xref numbers from the kids array
        xrefs = list(map(int, kids[1][1:-1].replace("0 R", "").split()))
        return [sort_fields(doc, i) for i in xrefs…

View full answer

JorjMcKie · 2023-04-09T11:17:49Z

JorjMcKie
Apr 9, 2023
Maintainer

The issue here is that in PyMuPDF fields / widgets are kids of pages. To simulate pypdf2's behavior, one must use PyMuPDF's low-level functions - which still look similar enough to pypdf2.

With the low-level functions, you can access all of a PDF's object directly in a syntax close to PDF source code. So your above code would look like this:

import fitz, sys

def sort_fields(doc, field_xref):  # invoke with a field's cross ref number
    kids = doc.xref_get_key(field_xref, "Kids")  # are there kids?
    if kids[0] == "array":  # extract xref numbers from the kids array
        xrefs = list(map(int, kids[1][1:-1].replace("0 R", "").split()))
        return [sort_fields(doc, i) for i in xrefs]  # return list of kid field values
    return doc.xref_get_key(field_xref, "V")[1]  # return the field value


def main():
    doc = fitz.open("input.pdf")  # open PDF
    root = doc.pdf_catalog()  # access its catalog
    field_xrefs = doc.xref_get_key(root, "AcroForm/Fields")[1]  # extract array of field xref numbers
    if field_xrefs == "null":
        sys.exit("Document has no fields")
    # the array looks like "[4711 0 R 4712 0 R 123 0 R ...]"
    # xref numbers of all fields
    field_xrefs = list(map(int, field_xrefs[1:-1].replace("0 R", "").split()))
    for xref in field_xrefs:
        value = sort_fields(doc, xref)
        print(f"Field {xref} has value '{value}'.")

Note that in PDF, the string "null" represents what is called None in Python.

0 replies

vicrem · 2023-04-10T01:12:01Z

vicrem
Apr 10, 2023
Author

Thank you @JorjMcKie! Your code does exactly what I want :) how can I buy you a beer?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pyPDF2 to PyMuPDF #2327

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

pyPDF2 to PyMuPDF #2327

Uh oh!

vicrem Apr 9, 2023

Replies: 2 comments

Uh oh!

Uh oh!

JorjMcKie Apr 9, 2023 Maintainer

Uh oh!

vicrem Apr 10, 2023 Author

vicrem
Apr 9, 2023

JorjMcKie
Apr 9, 2023
Maintainer

vicrem
Apr 10, 2023
Author