Modifying PDF Page Content Streams In-Place #1770

fergusonjason · 2025-12-28T20:44:00Z

fergusonjason
Dec 28, 2025

I've asked this on Stack Overflow, but hey, can't hurt to ask here.

I'm trying to modify PDF page content streams in-place, specifically to add MCID identifiers to text blocks. Right now, my approach is rather brute force, but it's early.

I'm trying to use the below code to follow the trail from Catalog -> Pages -> Page -> PageContent (PDFRawStream), grab the stream, reprocess it, and replace the original with the same object reference.

What's happening is that I can build the PDFContext, but when I call PDFDocument.save(), I get a malformed PDF that ONLY contains the objects for the reprocessed streams. EVERY OTHER OBJECT IS GONE in the new PDF.

I'd prefer to do an in-place replacement to rebuilding an entire PDF object tree, but frankly I'm taking this project on out of spite: the few libraries that do this are commercial and charge more than $1k per seat.

` async reprocessPDF2(input:Uint8Array): Promise {

const pdfDoc: PDFDocument = await PDFDocument.load(input);
const pdfContext: PDFContext = pdfDoc.context;
const catalogRef = pdfContext.trailerInfo.Root; // PDFRef
const catalog = pdfContext.lookup(catalogRef); // PDFDict

if (!(catalog instanceof PDFDict)) {
  throw new Error("Invalid PDF: Catalog is not a dictionary");
}

const pagesRef = catalog.get(PDF_NAME_PAGES);
if (!pagesRef || !(pagesRef instanceof PDFRef)) {
  throw new Error("Invalid PDF: Catalog missing /Pages reference");
}

// let's get the /Kids and walk them to get PDFRef for a /Type /Page
const pagesDict: PDFDict = pdfContext.lookup(pagesRef) as PDFDict;
const pageRefs: PDFRef[] = getAllPageRefs(pagesDict, pdfContext);

// okay, so we're not getting the references of the page content objects, so
// we're not running them through the transformers

for (const pageRef of pageRefs) {
  const pageObj: PDFObject | undefined = pdfContext.lookup(pageRef);
  if (!pageObj || !(pageObj instanceof PDFDict)) {
    continue;
  }

  // Get the /Contents entry from the page dictionary
  const contentsEntry = pageObj.get(PDFName.of('Contents'));
  if (!contentsEntry) {
    continue; // Page has no content
  }

  // Contents can be a single stream reference or an array of stream references
  const contentRefs: PDFRef[] = [];
  if (contentsEntry instanceof PDFRef) {
    contentRefs.push(contentsEntry);
  } else if (contentsEntry instanceof PDFArray) {
    // Handle array of content streams
    for (let i = 0; i < contentsEntry.size(); i++) {
      const ref = contentsEntry.get(i);
      if (ref instanceof PDFRef) {
        contentRefs.push(ref);
      }
    }
  }

  const mcidCounter: MCIDCounter = new MCIDCounter(0);

  for (const contentRef of contentRefs) {
    const contentStream = pdfContext.lookup(contentRef);
    if (!(contentStream instanceof PDFStream)) {
      continue;
    }

    const filters = contentStream.dict.get(PDFName.of('Filter'));
    const filterNames: PDFName[] = this.normalizeFilters(filters);

    // Build the stream transform pipeline
    const pipeline = [];
    const hasFlateDecodeFilter = this.hasFilter('FlateDecode', filterNames);

    if (hasFlateDecodeFilter) {
      pipeline.push(inflateStream);
    }

    console.log("MCID counter before pipeline:", mcidCounter.current());
    pipeline.push(createMcidStreamTransformer(mcidCounter));
    console.log("MCID counter after pipeline:", mcidCounter.current());

    if (hasFlateDecodeFilter) {
      pipeline.push(deflateStream);
    }

    // Process the stream through the pipeline
    let processedStream = contentStream.getContents();
    for (const transformFn of pipeline) {
      processedStream = await transformFn(processedStream);
    }

    const newDict = contentStream.dict.clone(pdfContext);
    newDict.set(PDFName.of('Length'), PDFNumber.of(processedStream.length));
    const newStream = PDFRawStream.of(newDict, processedStream);

    // Update the content stream reference in the context
    pdfContext.assign(contentRef, newStream);
  }
}

return pdfDoc.save();

}`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modifying PDF Page Content Streams In-Place #1770

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Modifying PDF Page Content Streams In-Place #1770

Uh oh!

fergusonjason Dec 28, 2025

Replies: 0 comments

fergusonjason
Dec 28, 2025