Skip to content

Commit b0e0526

Browse files
committed
Add Widget Support in Method "Document.insert_pdf"
We previously omitted form fields in source PDFs when merging PDFs via "target.insert_pdf(source)". This feature has frequently been requested. This fix now adds the feature as an optional category of page objects, alongside the already supported annotations and links.
1 parent a5fa4a8 commit b0e0526

File tree

8 files changed

+196
-37
lines changed

8 files changed

+196
-37
lines changed

docs/document.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1155,7 +1155,7 @@ For details on **embedded files** refer to Appendix 3.
11551155

11561156
Please consider that annotations are complex objects and may consist of more data "underneath" their visual appearance. Examples are "Text" and "FileAttachment" annotations. When "baking in" annotations / widgets with this method, all this underlying information (attached files, comments, associated PopUp annotations, etc.) will be lost and be removed on next garbage collection.
11571157

1158-
Use this feature for instance for methods :meth:`Document.insert_pdf` (which supports no copying of widgets) or :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.
1158+
Use this feature for instance for :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.
11591159

11601160

11611161
:arg bool annots: convert annotations.
@@ -1293,13 +1293,12 @@ For details on **embedded files** refer to Appendix 3.
12931293
pair: rotate; Document.insert_pdf
12941294
pair: links; Document.insert_pdf
12951295
pair: annots; Document.insert_pdf
1296+
pair: widgets; Document.insert_pdf
12961297
pair: show_progress; Document.insert_pdf
12971298

1298-
.. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, show_progress=0, final=1)
1299+
.. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, widgets=True, show_progress=0, final=1)
12991300

1300-
* Changed in v1.19.3 - as a fix to issue `#537 <https://github.com/pymupdf/PyMuPDF/issues/537>`_, form fields are always excluded.
1301-
1302-
PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links and annotations can be excluded in the target, see below. All page numbers are 0-based.
1301+
PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links, annotations and widgets can be excluded in the target, see below. All page numbers are 0-based.
13031302

13041303
:arg docsrc: An opened PDF *Document* which must not be the current document. However, it may refer to the same underlying file.
13051304
:type docsrc: *Document*
@@ -1313,13 +1312,14 @@ For details on **embedded files** refer to Appendix 3.
13131312
:arg int rotate: All copied pages will be rotated by the provided value (degrees, integer multiple of 90).
13141313

13151314
:arg bool links: Choose whether (internal and external) links should be included in the copy. Default is `True`. *Named* links (:data:`LINK_NAMED`) and internal links to outside the copied page range are **always excluded**.
1316-
:arg bool annots: *(new in v1.16.1)* choose whether annotations should be included in the copy. Form **fields can never be copied** -- see below.
1315+
:arg bool annots: choose whether annotations should be included in the copy.
1316+
:arg bool widgets: choose whether annotations should be included in the copy. If `True` and at least one of the source pages contains form fields, the target PDF will be turned into a Form PDF (if not already being one).
13171317
:arg int show_progress: *(new in v1.17.7)* specify an interval size greater zero to see progress messages on `sys.stdout`. After each interval, a message like `Inserted 30 of 47 pages.` will be printed.
13181318
:arg int final: *(new in v1.18.0)* controls whether the list of already copied objects should be **dropped** after this method, default *True*. Set it to 0 except for the last one of multiple insertions from the same source PDF. This saves target file size and speeds up execution considerably.
13191319

13201320
.. note::
13211321

1322-
1. This is a page-based method. Document-level information of source documents is therefore ignored. Examples include Optional Content, Embedded Files, `StructureElem`, `AcroForm`, table of contents, page labels, metadata, named destinations (and other named entries) and some more. As a consequence, specifically, **Form Fields (widgets) can never be copied** -- although they seem to appear on pages only. Look at :meth:`Document.bake` for converting a source document if you need to retain at least widget **appearances.**
1322+
1. This is a page-based method. Document-level information of source documents is therefore mostly ignored. Examples include Optional Content, Embedded Files, `StructureElem`, table of contents, page labels, metadata, named destinations (and other named entries) and some more.
13231323

13241324
2. If `from_page > to_page`, pages will be **copied in reverse order**. If `0 <= from_page == to_page`, then one page will be copied.
13251325

docs/the-basics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ With :meth:`Document.insert_file` you can invoke the method to merge :ref:`suppo
198198

199199
**Taking it further**
200200

201-
It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation. This Wiki `article <https://github.com/pymupdf/PyMuPDF/wiki/Inserting-Pages-from-other-PDFs>`_ contains a full description.
201+
It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation.
202202

203203
The GUI script `join.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/join-documents/join.py>`_ uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:
204204

src/__init__.py

Lines changed: 27 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4589,12 +4589,14 @@ def insert_file(self,
45894589
def insert_pdf(
45904590
self,
45914591
docsrc,
4592+
*,
45924593
from_page=-1,
45934594
to_page=-1,
45944595
start_at=-1,
45954596
rotate=-1,
45964597
links=1,
45974598
annots=1,
4599+
widgets=1,
45984600
show_progress=0,
45994601
final=1,
46004602
_gmap=None,
@@ -4609,6 +4611,7 @@ def insert_pdf(
46094611
rotate: (int) rotate copied pages, default -1 is no change.
46104612
links: (int/bool) whether to also copy links.
46114613
annots: (int/bool) whether to also copy annotations.
4614+
widgets: (int/bool) whether to also copy form fields.
46124615
show_progress: (int) progress message interval, 0 is no messages.
46134616
final: (bool) indicates last insertion from this source PDF.
46144617
_gmap: internal use only
@@ -4626,6 +4629,26 @@ def insert_pdf(
46264629
sa = start_at
46274630
if sa < 0:
46284631
sa = self.page_count
4632+
outCount = self.page_count
4633+
srcCount = docsrc.page_count
4634+
4635+
# local copies of page numbers
4636+
fp = from_page
4637+
tp = to_page
4638+
sa = start_at
4639+
4640+
# normalize page numbers
4641+
fp = max(fp, 0) # -1 = first page
4642+
fp = min(fp, srcCount - 1) # but do not exceed last page
4643+
4644+
if tp < 0:
4645+
tp = srcCount - 1 # -1 = last page
4646+
tp = min(tp, srcCount - 1) # but do not exceed last page
4647+
4648+
if sa < 0:
4649+
sa = outCount # -1 = behind last page
4650+
sa = min(sa, outCount) # but that is also the limit
4651+
46294652
if len(docsrc) > show_progress > 0:
46304653
inname = os.path.basename(docsrc.name)
46314654
if not inname:
@@ -4663,25 +4686,6 @@ def insert_pdf(
46634686
else:
46644687
pdfout = _as_pdf_document(self)
46654688
pdfsrc = _as_pdf_document(docsrc)
4666-
outCount = mupdf.fz_count_pages(self)
4667-
srcCount = mupdf.fz_count_pages(docsrc.this)
4668-
4669-
# local copies of page numbers
4670-
fp = from_page
4671-
tp = to_page
4672-
sa = start_at
4673-
4674-
# normalize page numbers
4675-
fp = max(fp, 0) # -1 = first page
4676-
fp = min(fp, srcCount - 1) # but do not exceed last page
4677-
4678-
if tp < 0:
4679-
tp = srcCount - 1 # -1 = last page
4680-
tp = min(tp, srcCount - 1) # but do not exceed last page
4681-
4682-
if sa < 0:
4683-
sa = outCount # -1 = behind last page
4684-
sa = min(sa, outCount) # but that is also the limit
46854689

46864690
if not pdfout.m_internal or not pdfsrc.m_internal:
46874691
raise TypeError( "source or target not a PDF")
@@ -4692,7 +4696,9 @@ def insert_pdf(
46924696
self._reset_page_refs()
46934697
if links:
46944698
#log( 'insert_pdf(): calling self._do_links()')
4695-
self._do_links(docsrc, from_page = from_page, to_page = to_page, start_at = sa)
4699+
self._do_links(docsrc, from_page=fp, to_page=tp, start_at=sa)
4700+
if widgets:
4701+
self._do_widgets(docsrc, _gmap, from_page=fp, to_page=tp, start_at=sa)
46964702
if final == 1:
46974703
self.Graftmaps[isrt] = None
46984704
#log( 'insert_pdf(): returning')
@@ -20150,9 +20156,6 @@ def page_merge(doc_des, doc_src, page_from, page_to, rotate, links, copy_annots,
2015020156
continue
2015120157
if mupdf.pdf_name_eq( subtype, PDF_NAME('Popup')):
2015220158
continue
20153-
if mupdf.pdf_name_eq( subtype, PDF_NAME('Widget')):
20154-
mupdf.fz_warn( "skipping widget annotation")
20155-
continue
2015620159
if mupdf.pdf_name_eq(subtype, PDF_NAME('Widget')):
2015720160
continue
2015820161
mupdf.pdf_dict_del( o, PDF_NAME('Popup'))
@@ -21295,6 +21298,7 @@ def _atexit():
2129521298
Annot.get_textbox = utils.get_textbox
2129621299

2129721300
Document._do_links = utils.do_links
21301+
Document._do_widgets = utils.do_widgets
2129821302
Document.del_toc_item = utils.del_toc_item
2129921303
Document.get_char_widths = utils.get_char_widths
2130021304
Document.get_oc = utils.get_oc

src/extra.i

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -346,11 +346,6 @@ static void page_merge(
346346
mupdf::PdfObj subtype = mupdf::pdf_dict_get(o, PDF_NAME(Subtype));
347347
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Link))) continue;
348348
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Popup))) continue;
349-
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget)))
350-
{
351-
mupdf::fz_warn("skipping widget annotation");
352-
continue;
353-
}
354349
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget))) continue;
355350
mupdf::pdf_dict_del(o, PDF_NAME(Popup));
356351
mupdf::pdf_dict_del(o, PDF_NAME(P));

src/utils.py

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1675,6 +1675,131 @@ def set_toc(
16751675
return toclen
16761676

16771677

1678+
def do_widgets(
1679+
tar: pymupdf.Document,
1680+
src: pymupdf.Document,
1681+
graftmap,
1682+
from_page: int = -1,
1683+
to_page: int = -1,
1684+
start_at: int = -1,
1685+
) -> None:
1686+
"""Insert widgets contained in copied page range into destination PDF.
1687+
1688+
Parameter values **must** equal those of method insert_pdf(). Method
1689+
insert_pdf() which must have been previously executed.
1690+
"""
1691+
if not src.is_form_pdf: # nothing to do: source PDF has no fields
1692+
return
1693+
1694+
def get_acroform(doc):
1695+
"""Retrieve the AcroForm dictionary form a PDF."""
1696+
pdf = mupdf.pdf_document_from_fz_document(doc)
1697+
# AcroForm (= central form field info)
1698+
return mupdf.pdf_dict_getp(mupdf.pdf_trailer(pdf), "Root/AcroForm")
1699+
1700+
tarpdf = mupdf.pdf_document_from_fz_document(tar)
1701+
srcpdf = mupdf.pdf_document_from_fz_document(src)
1702+
1703+
if tar.is_form_pdf:
1704+
# target is a Form PDF, so use its AcroForm to include source fields
1705+
acro = get_acroform(tar)
1706+
# Important arrays of indirect objects
1707+
tar_fields = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("Fields"))
1708+
tar_co = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("CO"))
1709+
if not mupdf.pdf_is_array(tar_co):
1710+
tar_co = mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)
1711+
else:
1712+
# target is no Form PDF, so copy over source AcroForm
1713+
acro = mupdf.pdf_deep_copy_obj(get_acroform(src)) # make a copy
1714+
1715+
# Clear "Fields" and "CO" arrays: will be populated by page fields.
1716+
# This is required to avoid copying unneeded objects.
1717+
mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("Fields"))
1718+
mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("Fields"), 5)
1719+
mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("CO"))
1720+
mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)
1721+
1722+
# Enrich AcroForm for copying to target
1723+
acro_graft = mupdf.pdf_graft_mapped_object(graftmap, acro)
1724+
1725+
# Insert AcroForm into target PDF
1726+
acro_tar = mupdf.pdf_add_object(tarpdf, acro_graft)
1727+
tar_fields = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("Fields"))
1728+
tar_co = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("CO"))
1729+
1730+
# get its xref and insert it into target catalog
1731+
tar_xref = mupdf.pdf_to_num(acro_tar)
1732+
acro_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)
1733+
root = mupdf.pdf_dict_get(mupdf.pdf_trailer(tarpdf), pymupdf.PDF_NAME("Root"))
1734+
mupdf.pdf_dict_put(root, pymupdf.PDF_NAME("AcroForm"), acro_tar_ind)
1735+
1736+
if from_page <= to_page:
1737+
src_range = range(from_page, to_page + 1)
1738+
else:
1739+
src_range = range(from_page, to_page - 1, -1)
1740+
1741+
for i in range(len(src_range)):
1742+
# read first page that was copied over
1743+
tar_page = tar[start_at + i]
1744+
1745+
# convert it to a formal PDF page
1746+
tar_page_pdf = mupdf.pdf_page_from_fz_page(tar_page)
1747+
1748+
# extract its annotations array
1749+
tar_annots = mupdf.pdf_dict_get(tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"))
1750+
if not mupdf.pdf_is_array(tar_annots):
1751+
tar_annots = mupdf.pdf_dict_put_array(
1752+
tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"), 5
1753+
)
1754+
1755+
# read the original page in the source PDF
1756+
src_page = src[src_range[i]]
1757+
1758+
# now walk through source page widgets and copy over
1759+
w_xrefs = [ # widget xrefs of the source page
1760+
xref
1761+
for xref, wtype, _ in src_page.annot_xrefs()
1762+
if wtype == pymupdf.PDF_ANNOT_WIDGET # pylint: disable=no-member
1763+
]
1764+
1765+
# Remove page references from widgets to prevent duplicate copies
1766+
# of the page in the target.
1767+
for xref in w_xrefs:
1768+
w_obj = mupdf.pdf_load_object(srcpdf, xref)
1769+
mupdf.pdf_dict_del(w_obj, pymupdf.PDF_NAME("P"))
1770+
1771+
for xref in w_xrefs:
1772+
w_obj = mupdf.pdf_load_object(srcpdf, xref)
1773+
1774+
# check if field is a member of inter-field validations
1775+
temp = mupdf.pdf_dict_getp(w_obj, "AA/C")
1776+
if mupdf.pdf_is_dict(temp):
1777+
is_aac = True
1778+
else:
1779+
is_aac = False
1780+
1781+
# recursively complete the widget object with all referenced objects
1782+
w_obj_graft = mupdf.pdf_graft_mapped_object(graftmap, w_obj)
1783+
1784+
# add the completed widget object to the target PDF
1785+
w_obj_tar = mupdf.pdf_add_object(tarpdf, w_obj_graft)
1786+
1787+
# extract its generated target xref number
1788+
tar_xref = mupdf.pdf_to_num(w_obj_tar)
1789+
1790+
# create an indirect object from it
1791+
w_obj_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)
1792+
1793+
# insert this xref reference into the page,
1794+
mupdf.pdf_array_push(tar_annots, w_obj_tar_ind)
1795+
1796+
# and also into "AcroForm/Fields",
1797+
mupdf.pdf_array_push(tar_fields, w_obj_tar_ind)
1798+
# and also into "AcroForm/CO" if a computation field.
1799+
if is_aac:
1800+
mupdf.pdf_array_push(tar_co, w_obj_tar_ind)
1801+
1802+
16781803
def do_links(
16791804
doc1: pymupdf.Document,
16801805
doc2: pymupdf.Document,
@@ -5354,7 +5479,7 @@ def has_annots(doc: pymupdf.Document) -> bool:
53545479
for i in range(doc.page_count):
53555480
for item in doc.page_annot_xrefs(i):
53565481
# pylint: disable=no-member
5357-
if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET):
5482+
if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET): # pylint: disable=no-member
53585483
return True
53595484
return False
53605485

tests/resources/cms-etc-filled.pdf

204 KB
Binary file not shown.
6.82 KB
Binary file not shown.

tests/test_insertpdf.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,3 +186,38 @@ def test_3789():
186186
# If this is the last split file, exit the loop
187187
if to_page == -1:
188188
break
189+
190+
191+
def test_widget_insert():
192+
"""Confirm copy of form fields / widgets."""
193+
from pymupdf import mupdf
194+
tar = pymupdf.open(os.path.join(resources, "cms-etc-filled.pdf"))
195+
pc0 = tar.page_count # for later assertion
196+
src = pymupdf.open(os.path.join(resources, "interfield-calculation.pdf"))
197+
pc1 = src.page_count # for later assertion
198+
199+
tarpdf = pymupdf._as_pdf_document(tar)
200+
tar_field_count = mupdf.pdf_array_len(
201+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
202+
)
203+
tar_co_count = mupdf.pdf_array_len(
204+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
205+
)
206+
srcpdf = pymupdf._as_pdf_document(src)
207+
src_field_count = mupdf.pdf_array_len(
208+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/Fields")
209+
)
210+
src_co_count = mupdf.pdf_array_len(
211+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/CO")
212+
)
213+
214+
tar.insert_pdf(src)
215+
new_field_count = mupdf.pdf_array_len(
216+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
217+
)
218+
new_co_count = mupdf.pdf_array_len(
219+
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
220+
)
221+
assert tar.page_count == pc0 + pc1
222+
assert new_field_count == tar_field_count + src_field_count
223+
assert new_co_count == tar_co_count + src_co_count

0 commit comments

Comments
 (0)