-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ENH: Wrap and align text in flattened PDF forms #3465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3465 +/- ##
==========================================
+ Coverage 97.10% 97.12% +0.01%
==========================================
Files 57 57
Lines 9711 9778 +67
Branches 1759 1773 +14
==========================================
+ Hits 9430 9497 +67
Misses 168 168
Partials 113 113 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This is a reworked version on top of #3466 Not for review right now. |
mypy complained that the .from_font_resource method's return type is Optional[FontDescriptor]. Change the code to not confuse mypy.
This adds a method to calculate the width of a text string. This method can later be used to wrap text at a certain length. Code blatantly copied from the _font.py file in the text extractor code.
|
@stefan6419846 First, thanks very much for merging the refactoring of appearance stream code from _writer.py to generic_appearance_stream.py! With that in place, it should now be easier to review this PR, which adds text wrapping, scaling and alignment for text appearance streams. |
This patch adds a method to scale and wrap text, depending on whether or not text is allowed to be wrapped. It takes a couple of arguments, including the text string itself, field width and height, font size, a FontDescriptor with character widths, and a bool specifying whether or not text is allowed to wrap. Returns the text in in the form of list of tuples, each tuple containing the length of a line and its contents, and the font size for these lines and lengths.
This patch scales and/or wrap text that does not fit into a text field unaltered, under the condition that font size was set to 0 in the default appearance stream. We only wrap text if the multiline bit was set in the corresponding annotation's field flags, otherwise we just scale the font until it fits. We move the escaping of parentheses below, so that it does not interfere with calculating the width of a text string.
Make sure that we always have Helvetica as a viable font resource, for which we surely have all necessary font metrics needed for text wrapping.
This patch changes the TextAppearanceStream code so that it can deal with right alignment and centered text. Note that both require correct font metrics in order to work.
We need the info that is in CORE_FONT_METRICS, and that is the same information as in _default_fonts_space_width anyway. So this patch removes a bit of redundancy.
Add tests for the TextStreamAppearance.
|
|
||
| If you want to flatten your form, that is, keeping all form field contents while | ||
| removing the form fields themselves, you can set `flatten=True` to convert form | ||
| field contents to regular pdf content, and then use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| field contents to regular pdf content, and then use | |
| field contents to regular PDF content, and then use |
| removing the form fields themselves, you can set `flatten=True` to convert form | ||
| field contents to regular pdf content, and then use | ||
| `writer.remove_annotations(subtypes="/Widget")` to remove all form fields. This | ||
| will result in a flattened pdf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| will result in a flattened pdf. | |
| will result in a flattened PDF. |
| font_size: float, | ||
| field_width: float, | ||
| field_height: float, | ||
| txt: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| txt: str, | |
| text: str, |
| return [(test_width, txt)], font_size | ||
| # Multiline: | ||
| orig_txt = txt | ||
| paragraphs = re.sub(r"\n", "\r", txt).split("\r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a regex for replacing plain newlines?
| return [(test_width, txt)], font_size | ||
| return [(test_width, txt)], font_size | ||
| # Multiline: | ||
| orig_txt = txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| orig_txt = txt | |
| original_text = text |
| """ | ||
| selection = "Option A" | ||
| assert (b"4.0 Tf") in appearance_stream.get_data() | ||
| text = "pneumonoultramicroscopicsilicovolcanoconiosis" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| text = "pneumonoultramicroscopicsilicovolcanoconiosis" | |
| text = "pneumonoultramicroscopicsilicovolcanoconiosis" |
| appearance_stream = TextStreamAppearance( | ||
| text, selection, rectangle=rectangle, font_size=font_size, is_multiline=is_multiline | ||
| ) | ||
| assert (b"7.2 Tf") in appearance_stream.get_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert (b"7.2 Tf") in appearance_stream.get_data() | |
| assert b"7.2 Tf" in appearance_stream.get_data() |
| text, selection, rectangle=rectangle, font_size=font_size, is_multiline=is_multiline | ||
| ) | ||
| assert (b"7.2 Tf") in appearance_stream.get_data() | ||
| rectangle = (0, 0, 10, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| rectangle = (0, 0, 10, 100) | |
| rectangle = (0, 0, 10, 100) |
| appearance_stream = TextStreamAppearance( | ||
| text, rectangle=rectangle, font_size=font_size, is_multiline=is_multiline | ||
| ) | ||
| assert (b"OneWord") in appearance_stream.get_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert (b"OneWord") in appearance_stream.get_data() | |
| assert b"OneWord" in appearance_stream.get_data() |
| assert writer.pages[0]["/Annots"][13].get_object()["/AP"]["/N"].get_data() == ( | ||
| b"q\n/Tx BMC \nq\n1 1 105.29520000000001 10.835000000000036 re\n" | ||
| b"W\nBT\n/Arial 8.0 Tf 0 g\n2 2.8350000000000364 Td\n(0) Tj\nET\n" | ||
| b"W\nBT\n/Helv 8.0 Tf 0 g\n2 2.8350000000000364 Td\n(0) Tj\nET\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we change the font default?
This patch implements text wrapping and alignment in appearance streams.
The scale_text method was vibe-coded, as well as the code for right-aligned text and centered text, but they both work great.
The result offers a good basis for text wrapping. I did notice, however, that the results with pdftk are better. In the future, it would be nice to read the info for the annotation border from the annotiation instead of just adding some padding here and there (which is the case now). Also, I notice there's also an annotation option called "comb" that is not taken into account. Then there is annotation text colour... Finally, pdftk takes into account the font bounding box / ascent in deciding scaled font size.
For now, however, this PR "finishes" PDF flattening in the sense that it correctly wraps long texts and aligns it as intended.
Related but not fixed here: #2153
I think this does fix the alignment part of #1919