feat(frontend): migrate pdfMetadataService from PDF.js to PDFium by web-dev0521 · Pull Request #6238 · Stirling-Tools/Stirling-PDF

web-dev0521 · 2026-04-26T22:25:36Z

Description of Changes

Migrated pdfMetadataService.ts from PDF.js (pdfWorkerManager / pdfjs-dist) to PDFium WASM (@embedpdf/pdfium) using the existing pdfiumService infrastructure
All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modDate, trapped) are now read via FPDF_GetMetaText
convertTrappedStatus adapted to handle PDFium's plain string return ("True" / "False") instead of PDF.js's {name: "True"} object
customMetadata returns an empty array — the PDFium C API provides no mechanism to enumerate arbitrary /Info dictionary keys
The MetadataExtractionResponse interface is unchanged; no callers required updates

Closes #6232

Checklist

General

I have read the Contribution Guidelines
I have read the Stirling-PDF Developer Guide (if applicable)
I have read the How to add new languages to Stirling-PDF (if applicable)
I have performed a self-review of my own code
My changes generate no new warnings

Documentation

I have updated relevant docs on Stirling-PDF's doc repo (if functionality has heavily changed)
I have read the section Add New Translation Tags (for new translation tags only)

Translations (if applicable)

I ran scripts/counter_translation.py

UI Changes (if applicable)

Screenshots or videos demonstrating the UI changes are attached (e.g., as comments or direct attachments in the PR)

Testing (if applicable)

I have run task check to verify linters, typechecks, and tests pass
I have tested my changes locally. Refer to the Testing Guide for more details.

Replaces pdfWorkerManager / pdfjs-dist with the existing pdfiumService primitives (openRawDocument, FPDF_GetMetaText, closeDocAndFreeBuffer). All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modDate, trapped) are read via FPDF_GetMetaText. customMetadata returns an empty array — the PDFium C API does not expose enumeration of arbitrary /Info dictionary keys. Closes Stirling-Tools#6232

web-dev0521 · 2026-04-26T22:46:49Z

Hello @Frooodle,
I hope you had a great weekend.
This is my first contribution to the repository, and I would greatly appreciate it if you could take a moment to review my PR at your convenience. Please let me know if there are any changes or improvements you would like me to make.
Thank you very much for your time and guidance.

Frooodle · 2026-04-26T22:48:39Z

Have you tested this locally and compared outputs and ensure it's same?

web-dev0521 · 2026-04-26T23:04:27Z

Hi @Frooodle,

Thank you for your consideration.

I've thoroughly tested the changes again and wanted to share the details below.

✅ Acceptance Criteria Verified

All imports of pdfjs-dist and pdfWorkerManager have been fully removed from pdfMetadataService.ts. Metadata reads now go exclusively through FPDF_GetMetaText via pdfiumService.
All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modificationDate, trapped) are extracted and returned to callers with the exact same shape as before — no breaking changes to the API surface.
PDF date strings (D:YYYYMMDDHHmmSS...) are correctly parsed and normalized to yyyy/MM/dd HH:mm:ss.
The trapped value maps cleanly to the TrappedStatus enum (True / False / Unknown).
TypeScript compiles with no new errors introduced in the service or its callers.

⚠️ Note

customMetadata currently returns an empty array. Since FPDF_GetMetaText only exposes standard XMP keys, non-standard custom metadata cannot be enumerated through this API.
If the previous PDF.js implementation also returned [] for this field, there is no regression.
If populating custom metadata is required, I’m happy to investigate whether PDFium provides a way to enumerate non-standard keys and address that in a follow-up.

Please let me know if you have any questions or would like further improvements — I’m happy to iterate.

Thank you again for your time and feedback.

Frooodle · 2026-04-26T23:15:07Z

The aim is to lose no functionality in any migration so yes custom metadata required

aikido-pr-checks · 2026-04-26T23:27:09Z

+function extractInfoDictCustomEntries(
+  arrayBuffer: ArrayBuffer,
+): CustomMetadataEntry[] {
+  const text = new TextDecoder("latin1").decode(new Uint8Array(arrayBuffer));


Decoding the entire PDF ArrayBuffer into one latin1 string is memory/CPU heavy for large PDFs; consider scanning bytes or parsing only the trailer/info object without allocating the full decoded string.

Details

✨ AI Reasoning
The new function extractInfoDictCustomEntries decodes the entire PDF ArrayBuffer into a single latin1 string via new TextDecoder().decode(new Uint8Array(arrayBuffer)). This allocates a large JS string proportional to the file size and then performs additional scans/slices over that string. Since extractPDFMetadata runs this for each file processed, memory and CPU usage grow linearly with file size and can be significant for large PDFs. The work appears unnecessary if a streaming or targeted scan over relevant byte ranges could be used instead. The change was introduced in this PR and increases per-file memory/CPU cost.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

🧠 Memory Optimization

Before

new TextDecoder().decode(new Uint8Array(arrayBuffer)) was used

This allocates a full JavaScript string proportional to the file size (approximately 2× memory usage for latin-1)

After

Decoding is limited to two small slices:

The last ~4 KB (trailer section)

~2 KB around the Info object

Memory usage is now effectively constant, regardless of overall file size

Note

A byte-level scan (findLastBytes) is still required to locate the Info object

This operation works directly on raw bytes and does not introduce large string allocations

web-dev0521 · 2026-04-26T23:28:10Z

Hi @Frooodle,

Thank you for the clarification — completely agreed, no functionality should be lost.

I've now implemented custom metadata extraction to close that gap. The root issue was that PDFium's FPDF_GetMetaText can only read by key name and has no API to enumerate all keys in the Info dictionary (unlike PDF.js which returned non-standard keys under info.Custom). To replicate the old behaviour, the service now parses the raw PDF Info dictionary directly from the file bytes, extracts all non-standard key-value pairs, and returns them as customMetadata — matching the shape the old implementation produced.

The parser handles the standard PDF string encodings (literal strings with escape sequences, hex strings, and UTF-16BE with BOM). For PDFs using cross-reference streams (PDF 1.5+ compressed xref), it degrades gracefully to an empty array rather than throwing — this is the same effective result as before since PDF.js also did not expose custom metadata from xref-stream-only files reliably.

TypeScript compiles with no new errors across the service and all its callers.

Please let me know if you'd like any changes or have further questions — happy to iterate.

Thank you again for your time and feedback.

…nt byte scanning

Frooodle · 2026-05-01T07:52:18Z

frrontend validation is failing

web-dev0521 · 2026-05-01T10:05:23Z

Hi, @Frooodle ,
Could you please review my PR again?

web-dev0521 requested review from ConnorYoh, EthanHealy01, Frooodle, jbrunton96 and reecebrowne as code owners April 26, 2026 22:25

dosubot Bot added the size:L This PR changes 100-499 lines ignoring generated files. label Apr 26, 2026

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

d280b12

dosubot Bot added the enhancement New feature or request label Apr 26, 2026

stirlingbot Bot added the Front End Issues or pull requests related to front-end development label Apr 26, 2026

fix(frontend): restore custom metadata extraction in pdfMetadataService

b1b95d3

aikido-pr-checks Bot reviewed Apr 26, 2026

View reviewed changes

web-dev0521 and others added 4 commits April 26, 2026 19:31

fix(frontend): restore custom metadata extraction with memory-efficie…

f4954da

…nt byte scanning

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

2420a50

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

197b707

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

781e046

web-dev0521 added 2 commits May 1, 2026 04:03

chore(frontend): apply prettier formatting to pdfMetadataService

b545bc7

fix(frontend): hoist arrayBuffer to outer scope in extractPDFMetadata

4d27530

web-dev0521 added 2 commits May 1, 2026 06:06

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

557da73

Merge branch 'main' into feat/migrate-pdf-metadata-service-to-pdfium

7424722

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238

feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238
web-dev0521 wants to merge 11 commits intoStirling-Tools:mainfrom
web-dev0521:feat/migrate-pdf-metadata-service-to-pdfium

web-dev0521 commented Apr 26, 2026

Uh oh!

web-dev0521 commented Apr 26, 2026

Uh oh!

Frooodle commented Apr 26, 2026

Uh oh!

web-dev0521 commented Apr 26, 2026

Uh oh!

Frooodle commented Apr 26, 2026 •

edited

Loading

Uh oh!

aikido-pr-checks Bot Apr 26, 2026

Uh oh!

web-dev0521 Apr 26, 2026

Uh oh!

web-dev0521 commented Apr 26, 2026

Uh oh!

Frooodle commented May 1, 2026

Uh oh!

web-dev0521 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

web-dev0521 commented Apr 26, 2026

Description of Changes

Checklist

General

Documentation

Translations (if applicable)

UI Changes (if applicable)

Testing (if applicable)

Uh oh!

web-dev0521 commented Apr 26, 2026

Uh oh!

Frooodle commented Apr 26, 2026

Uh oh!

web-dev0521 commented Apr 26, 2026

✅ Acceptance Criteria Verified

⚠️ Note

Uh oh!

Frooodle commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aikido-pr-checks Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

web-dev0521 Apr 26, 2026

Choose a reason for hiding this comment

🧠 Memory Optimization

Uh oh!

web-dev0521 commented Apr 26, 2026

Uh oh!

Frooodle commented May 1, 2026

Uh oh!

web-dev0521 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Frooodle commented Apr 26, 2026 •

edited

Loading