Skip to content

Detect omitted gaps (_____) in recent volumes #294

@joewiz

Description

@joewiz

As described in #282, several recent volumes exhibit a problem where certain gaps—namely, a horizontal line under a segment of text that represents a word 'omitted' or 'to be filled in' as on a form—are omitted from TEI deliveries from our typesetter. The lines are present in the PDF but not in the TEI.

An omission like this is fiendishly difficult to detect.

That PR discovered a phenomenon that was commonly associated with this omission - a space preceding a punctuation character. It added Schematron rules to flag such cases. But this also flags false positives (sometimes simply typos), and isn't guaranteed to identify all such cases.

As an alternative to a page-by-page review, a post in the DH Slack alerted me to a utility, pdfplumber, described as follows:

Plumb a PDF for detailed information about each text character, rectangle, and line... Works best on machine-generated, rather than scanned, PDFs.

One of the objects that pdfplumber reports on is "lines". Running the utility on a volume known to have blanks, I was happy to find that pdfplumber identifies these lines—or rather, all lines in our volumes: lines beneath running heads, footnote separators, underlined text in table headings. The common feature of the gap lines we're looking for is that they appear to all have a length of "30". I ran the utility on all volumes with PDFs and wrote an XQuery report to reveal the instances:

Screen Shot 2021-12-08 at 1 19 02 PM

Selecting a volume, the report shows each page where a matching line was detected, alongside the corresponding TEI, to help us identify if the TEI needs to be fixed:

Screen Shot 2021-12-08 at 1 24 53 PM

Further testing will be needed to confirm if we can count on the value of "30" for the length of lines. But this appears to be a promising approach for identifying these gaps.

As with the FRUS XPath Explorer, the tool can craft links that open oXygen to the exact location of the page shown, to facilitate editing of the source TEI document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions