-
Notifications
You must be signed in to change notification settings - Fork 367
Watermarking script #6017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Watermarking script #6017
Changes from all commits
6648848
8ea1e4f
7d131fe
ac597b6
9026a51
c58a0bb
5083e2b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| Add ACL-like footer (first page) and optional page numbers (all pages). | ||
| Inline italics with <i>…</i>. | ||
| Examples: | ||
| python add_footer.py in.pdf out.pdf \ | ||
| "<i>Proceedings … pages 8697–8727</i>\nJuly 27 - August 1, 2025 ©2025 ACL" | ||
| python add_footer.py -p 199 in.pdf out.pdf "…" | ||
| python add_footer.py -p 199 --footer-size 9 --pagenum-size 10 --bottom-margin 14 in.pdf out.pdf "…" | ||
| Copyright 2025, Matt Post | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not Apache license, like all the other scripts?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just an oversight. This is mostly just a proof of concept. |
||
| """ | ||
|
|
||
|
|
||
| import io | ||
| import re | ||
| import argparse | ||
| from pathlib import Path | ||
| from pypdf import PdfReader, PdfWriter | ||
| from reportlab.pdfgen import canvas | ||
|
|
||
| # Defaults tuned for ACL footer look | ||
| DEFAULT_BOTTOM_MARGIN_PT = 14 | ||
| DEFAULT_LINE_SPACING = 1.2 | ||
| DEFAULT_FOOTER_SIZE = 9 # footer text size | ||
| DEFAULT_PAGENUM_SIZE = 11 # page number size | ||
|
|
||
| FONT_REG = "Times-Roman" | ||
| FONT_ITAL = "Times-Italic" | ||
|
|
||
| TAG_RE = re.compile(r"(</?i>)") | ||
|
|
||
|
|
||
| def parse_inline_italics(s): | ||
| """Yield (text, is_italic) spans from a string with <i>…</i> regions.""" | ||
| parts = TAG_RE.split(s) | ||
| italic = False | ||
| for tok in parts: | ||
| if tok == "<i>": | ||
| italic = True | ||
| elif tok == "</i>": | ||
| italic = False | ||
| elif tok: | ||
| yield tok, italic | ||
|
|
||
|
|
||
| def measure_line(c, line, size): | ||
| """Total width of a mixed-style line.""" | ||
| w = 0.0 | ||
| for txt, it in parse_inline_italics(line): | ||
| font = FONT_ITAL if it else FONT_REG | ||
| w += c.stringWidth(txt, font, size) | ||
| return w | ||
|
|
||
|
|
||
| def draw_rich_centered(c, page_w, y, line, size): | ||
| """Draw a mixed-style line centered at y.""" | ||
| total_w = measure_line(c, line, size) | ||
| x = (page_w - total_w) / 2.0 | ||
| for txt, it in parse_inline_italics(line): | ||
| font = FONT_ITAL if it else FONT_REG | ||
| c.setFont(font, size) | ||
| c.drawString(x, y, txt) | ||
| x += c.stringWidth(txt, font, size) | ||
|
|
||
|
|
||
| def mk_footer_overlay(w, h, text_block, bottom_margin, size, line_spacing): | ||
| """Footer block near bottom: render lines in given order, stacking downward.""" | ||
| buf = io.BytesIO() | ||
| c = canvas.Canvas(buf, pagesize=(w, h)) | ||
| lines = text_block.split("\n") if text_block else [] | ||
| if not lines: | ||
| c.showPage() | ||
| c.save() | ||
| buf.seek(0) | ||
| return buf | ||
|
|
||
| line_h = size * line_spacing | ||
| # Start y so that the FIRST line appears above subsequent lines, | ||
| # with the LAST line's baseline at bottom_margin. | ||
| y = bottom_margin + (len(lines) - 1) * line_h | ||
| for line in lines: | ||
| draw_rich_centered(c, w, y, line, size) | ||
| y -= line_h # next line goes BELOW | ||
| c.showPage() | ||
| c.save() | ||
| buf.seek(0) | ||
| return buf | ||
|
|
||
|
|
||
| def mk_pagenum_overlay(w, h, page_num, bottom_margin, size): | ||
| buf = io.BytesIO() | ||
| c = canvas.Canvas(buf, pagesize=(w, h)) | ||
| c.setFont(FONT_REG, size) | ||
| text = str(page_num) | ||
| tw = c.stringWidth(text, FONT_REG, size) | ||
| x = (w - tw) / 2.0 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mjpost I know nothing at all about reportlab, I was just commenting that its API seems to be very low-level, making it not very intuitive to follow what's going on in the code. For example, lines like x = (w - tw) / 2.0are not very descriptive, I can guess what this does but it’s hard to review because of it. Maybe not super important either for a script like this though. |
||
| y = bottom_margin | ||
| c.drawString(x, y, text) | ||
| c.showPage() | ||
| c.save() | ||
| buf.seek(0) | ||
| return buf | ||
|
|
||
|
|
||
| def process( | ||
| input_pdf, | ||
| output_pdf, | ||
| text_block, | ||
| page_start, | ||
| bottom_margin, | ||
| footer_size, | ||
| pagenum_size, | ||
| line_spacing, | ||
| ): | ||
| reader = PdfReader(str(input_pdf)) | ||
| writer = PdfWriter() | ||
|
|
||
| footer_cache, pnum_cache = {}, {} | ||
|
|
||
| for idx, page in enumerate(reader.pages, start=1): | ||
| w = float(page.mediabox.width) | ||
| h = float(page.mediabox.height) | ||
|
|
||
| disp_num = None if page_start is None else page_start + idx - 1 | ||
|
|
||
| # Page number: SAME bottom margin on every page | ||
| if disp_num is not None: | ||
| nkey = (w, h, disp_num, pagenum_size, bottom_margin) | ||
| if nkey not in pnum_cache: | ||
| pnum_cache[nkey] = PdfReader( | ||
| mk_pagenum_overlay(w, h, disp_num, bottom_margin, pagenum_size) | ||
| ).pages[0] | ||
| page.merge_page(pnum_cache[nkey]) | ||
|
|
||
| # Footer only on first page; place it ABOVE the fixed page number | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just noting again for the record that this is not where *ACL proceedings currently place the footer.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah....but the current ACL choice is ugly, and also (I suspect) just some random person's quick decision. Witness (from ACL 2025):
It's different even from ten years ago (source):
Maybe I shouldn't in turn just arbitrarily change it, but I think it looks better.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree that the second one looks better, but it still has the footer below the page number on the first page, which my (completely subjective) gut reaction finds more appealing :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regardless of subjective appeal, there is an argument though for making the footer of revisions consistent with the original.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The footstamp offset varies by conference and year. Our options are (a) come up with a good default, and ideally get ACL to consolidate on that or (b) provide more knobs in this user interface to allow users to fiddle and match the original. I guess we should do both.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now I obviously haven't checked all conferences, but regarding the two examples you posted, it seems to me that the difference between them is not actually the placement of the footer, but the margins of the page content. In other words, I think the absolute placement of the footer may actually be the same there. |
||
| if idx == 1 and text_block: | ||
| # raise footer so its LAST line sits above the page number by a small gap | ||
| gap = 0.6 * footer_size | ||
| footer_bottom = bottom_margin + pagenum_size + gap | ||
| fkey = (w, h, "footer", footer_size, footer_bottom, line_spacing, text_block) | ||
| if fkey not in footer_cache: | ||
| footer_cache[fkey] = PdfReader( | ||
| mk_footer_overlay( | ||
| w, h, text_block, footer_bottom, footer_size, line_spacing | ||
| ) | ||
| ).pages[0] | ||
| page.merge_page(footer_cache[fkey]) | ||
|
|
||
| writer.add_page(page) | ||
|
|
||
| with open(output_pdf, "wb") as f: | ||
| writer.write(f) | ||
|
|
||
|
|
||
| def main(): | ||
| ap = argparse.ArgumentParser( | ||
| description="Add ACL-like footer (first page) and optional page numbers (all pages)." | ||
| ) | ||
| ap.add_argument( | ||
| "--page-number", | ||
| "-p", | ||
| type=int, | ||
| metavar="N", | ||
| help="Enable page numbers starting at N (e.g., -p 5).", | ||
| ) | ||
| ap.add_argument( | ||
| "--bottom-margin", | ||
| type=float, | ||
| default=14, | ||
| help="Baseline distance from bottom (pt).", | ||
| ) | ||
| ap.add_argument( | ||
| "--footer-size", | ||
| type=float, | ||
| default=DEFAULT_FOOTER_SIZE, | ||
| help="Footer font size (pt).", | ||
| ) | ||
| ap.add_argument( | ||
| "--pagenum-size", | ||
| type=float, | ||
| default=DEFAULT_PAGENUM_SIZE, | ||
| help="Page number font size (pt).", | ||
| ) | ||
| ap.add_argument( | ||
| "--line-spacing", type=float, default=1.2, help="Footer line spacing multiplier." | ||
| ) | ||
| ap.add_argument("input_pdf", type=Path) | ||
| ap.add_argument("output_pdf", type=Path) | ||
| ap.add_argument( | ||
| "text_block", | ||
| nargs="?", | ||
| default="", | ||
| help="Footer text for FIRST page only. Use \\n for newlines. Use <i>…</i> for inline italics.", | ||
| ) | ||
| args = ap.parse_args() | ||
|
|
||
| # normalize literal "\n" | ||
| args.text_block = args.text_block.replace("\\n", "\n") | ||
|
|
||
| process( | ||
| args.input_pdf, | ||
| args.output_pdf, | ||
| args.text_block, | ||
| args.page_number, | ||
| args.bottom_margin, | ||
| args.footer_size, | ||
| args.pagenum_size, | ||
| args.line_spacing, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| #! /usr/bin/env python3 | ||
| # -*- coding: utf-8 -*- | ||
| # | ||
| # Copyright 2019 Matt Post <[email protected]> | ||
| # Copyright 2019–2025 Matt Post <[email protected]> | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
|
|
@@ -41,6 +41,7 @@ | |
| import shutil | ||
| import sys | ||
| import tempfile | ||
| import io | ||
|
|
||
| from git.repo.base import Repo | ||
|
|
||
|
|
@@ -58,6 +59,66 @@ | |
| import lxml.etree as ET | ||
|
|
||
| from datetime import datetime | ||
| from pypdf import PdfReader, PdfWriter | ||
| from reportlab.pdfgen import canvas | ||
|
|
||
| WATERMARK_FONT = "Times-Roman" | ||
| WATERMARK_SIZE = 16 | ||
| WATERMARK_LEFT_OFFSET_PT = ( | ||
| 27 # distance from left edge in points (50% increase for margin) | ||
| ) | ||
| WATERMARK_GRAY = 0.55 # medium gray like arXiv | ||
|
|
||
|
|
||
| def _make_vertical_watermark_page(w, h, text): | ||
| """Return a single-page PDF with vertical (rotated 90° CCW) watermark at left.""" | ||
| buf = io.BytesIO() | ||
| c = canvas.Canvas(buf, pagesize=(w, h)) | ||
| c.saveState() | ||
| c.setFont(WATERMARK_FONT, WATERMARK_SIZE) | ||
| c.setFillGray(WATERMARK_GRAY) | ||
| # Translate slightly from left then rotate so text reads bottom-to-top along left side. | ||
| c.translate(WATERMARK_LEFT_OFFSET_PT, 0) | ||
| c.rotate(90) | ||
| text_w = c.stringWidth(text, WATERMARK_FONT, WATERMARK_SIZE) | ||
| # Center along original page height (which becomes horizontal span after rotation) | ||
| x_draw = (h - text_w) / 2.0 | ||
| y_draw = 0 | ||
| c.drawString(x_draw, y_draw, text) | ||
| c.restoreState() | ||
| c.showPage() | ||
| c.save() | ||
| buf.seek(0) | ||
| return buf | ||
|
|
||
|
|
||
| def add_revision_watermark(pdf_path, anth_id, revno, date): | ||
| """Return path to temp PDF with watermark added to first page (revisions only).""" | ||
| reader = PdfReader(pdf_path) | ||
| if not reader.pages: | ||
| return pdf_path | ||
| writer = PdfWriter() | ||
| first = reader.pages[0] | ||
| w = float(first.mediabox.width) | ||
| h = float(first.mediabox.height) | ||
| # Format date as DD-Mon-YYYY (e.g., 17-Sep-2025) for watermark display only. | ||
| try: | ||
| dt = datetime.strptime(date, "%Y-%m-%d") | ||
| display_date = dt.strftime("%d %b %Y") | ||
| except ValueError: | ||
| # If already in some unexpected format, just use original string. | ||
| display_date = date | ||
| text = f"ACL Anthology ID {anth_id} / revision {revno} / {display_date}" | ||
| overlay = PdfReader(_make_vertical_watermark_page(w, h, text)).pages[0] | ||
| first.merge_page(overlay) | ||
| writer.add_page(first) | ||
| for p in reader.pages[1:]: | ||
| writer.add_page(p) | ||
| fd, tmp_path = tempfile.mkstemp(suffix=".pdf") | ||
| os.close(fd) | ||
| with open(tmp_path, "wb") as out_f: | ||
| writer.write(out_f) | ||
| return tmp_path | ||
|
|
||
|
|
||
| def validate_file_type(path): | ||
|
|
@@ -101,7 +162,7 @@ def maybe_copy(file_from, file_to): | |
|
|
||
| change_letter = "e" if change_type == "erratum" else "v" | ||
|
|
||
| checksum = compute_hash_from_file(pdf_path) | ||
| # checksum will be computed after potential watermark insertion | ||
|
|
||
| # Files for old-style IDs are stored under anthology-files/pdf/P/P19/* | ||
| # Files for new-style IDs are stored under anthology-files/pdf/2020.acl/* | ||
|
|
@@ -130,6 +191,14 @@ def maybe_copy(file_from, file_to): | |
| for revision in revisions: | ||
| revno = int(revision.attrib["id"]) + 1 | ||
|
|
||
| # Insert watermark for revisions before computing checksum / updating XML | ||
| watermarked_temp_path = None | ||
| if change_type == "revision": | ||
| watermarked_temp_path = add_revision_watermark(pdf_path, anth_id, revno, date) | ||
| pdf_path = watermarked_temp_path | ||
|
|
||
| checksum = compute_hash_from_file(pdf_path) | ||
|
|
||
| if not dry_run: | ||
| # Update the URL hash on the <url> tag | ||
| if change_type != "erratum": | ||
|
|
@@ -201,6 +270,17 @@ def maybe_copy(file_from, file_to): | |
| if change_type == "revision": | ||
| maybe_copy(pdf_path, canonical_path) | ||
|
|
||
| # Cleanup temp watermarked file if created | ||
| if ( | ||
| 'watermarked_temp_path' in locals() | ||
| and watermarked_temp_path | ||
| and os.path.exists(watermarked_temp_path) | ||
| ): | ||
| try: | ||
| os.remove(watermarked_temp_path) | ||
| except OSError: | ||
| pass | ||
|
|
||
|
|
||
| def main(args): | ||
| change_type = "erratum" if args.erratum else "revision" | ||
|
|
@@ -222,6 +302,7 @@ def main(args): | |
| args.explanation, | ||
| change_type=change_type, | ||
| dry_run=args.dry_run, | ||
| date=args.date, | ||
| ) | ||
|
|
||
| if args.path.startswith("http"): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,7 @@ pytest-cov | |
| python-slugify>=2.0 | ||
| pytz | ||
| PyYAML>=3.0 | ||
| reportlab | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If reportlab is here, pypdf should also be |
||
| requests | ||
| ruff~=0.3.4 | ||
| setuptools | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: I have been wondering about throwing all of our scripts into
bin/, which has become a mixture of (i) core build scripts, (ii) data ingestion & modification scripts, (iii) one-off scripts that are probably outdated by now, and (iv) other miscellaneous stuff. It’s quite unclear which of these scripts are still useful and what for, unless you look into each of them.I was wondering if we could start categorizing them into subfolders, or at least name them more explicitly (e.g. here I would prefer
add_footer_to_pdf.py).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely long overdue for a reorg.
bin/itself isn't that great of a name. One suggestion is to usescripts/instead, and then have some kind of minimal one-level nesting within it, following your taxonomy above: build, data, misc.