Skip to content

feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238

Open
web-dev0521 wants to merge 11 commits intoStirling-Tools:mainfrom
web-dev0521:feat/migrate-pdf-metadata-service-to-pdfium
Open

feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238
web-dev0521 wants to merge 11 commits intoStirling-Tools:mainfrom
web-dev0521:feat/migrate-pdf-metadata-service-to-pdfium

Conversation

@web-dev0521
Copy link
Copy Markdown

Description of Changes

  • Migrated pdfMetadataService.ts from PDF.js (pdfWorkerManager / pdfjs-dist) to PDFium WASM (@embedpdf/pdfium) using the existing pdfiumService infrastructure
  • All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modDate, trapped) are now read via FPDF_GetMetaText
  • convertTrappedStatus adapted to handle PDFium's plain string return ("True" / "False") instead of PDF.js's {name: "True"} object
  • customMetadata returns an empty array — the PDFium C API provides no mechanism to enumerate arbitrary /Info dictionary keys
  • The MetadataExtractionResponse interface is unchanged; no callers required updates

Closes #6232


Checklist

General

Documentation

Translations (if applicable)

UI Changes (if applicable)

  • Screenshots or videos demonstrating the UI changes are attached (e.g., as comments or direct attachments in the PR)

Testing (if applicable)

  • I have run task check to verify linters, typechecks, and tests pass
  • I have tested my changes locally. Refer to the Testing Guide for more details.

Replaces pdfWorkerManager / pdfjs-dist with the existing pdfiumService
primitives (openRawDocument, FPDF_GetMetaText, closeDocAndFreeBuffer).
All standard metadata fields (title, author, subject, keywords, creator,
producer, creationDate, modDate, trapped) are read via FPDF_GetMetaText.
customMetadata returns an empty array — the PDFium C API does not expose
enumeration of arbitrary /Info dictionary keys.

Closes Stirling-Tools#6232
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines ignoring generated files. label Apr 26, 2026
@dosubot dosubot Bot added the enhancement New feature or request label Apr 26, 2026
@stirlingbot stirlingbot Bot added the Front End Issues or pull requests related to front-end development label Apr 26, 2026
@web-dev0521
Copy link
Copy Markdown
Author

Hello @Frooodle,
I hope you had a great weekend.
This is my first contribution to the repository, and I would greatly appreciate it if you could take a moment to review my PR at your convenience. Please let me know if there are any changes or improvements you would like me to make.
Thank you very much for your time and guidance.

@Frooodle
Copy link
Copy Markdown
Member

Have you tested this locally and compared outputs and ensure it's same?

@web-dev0521
Copy link
Copy Markdown
Author

Hi @Frooodle,

Thank you for your consideration.

I've thoroughly tested the changes again and wanted to share the details below.

✅ Acceptance Criteria Verified

  • All imports of pdfjs-dist and pdfWorkerManager have been fully removed from pdfMetadataService.ts. Metadata reads now go exclusively through FPDF_GetMetaText via pdfiumService.
  • All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modificationDate, trapped) are extracted and returned to callers with the exact same shape as before — no breaking changes to the API surface.
  • PDF date strings (D:YYYYMMDDHHmmSS...) are correctly parsed and normalized to yyyy/MM/dd HH:mm:ss.
  • The trapped value maps cleanly to the TrappedStatus enum (True / False / Unknown).
  • TypeScript compiles with no new errors introduced in the service or its callers.

⚠️ Note

  • customMetadata currently returns an empty array. Since FPDF_GetMetaText only exposes standard XMP keys, non-standard custom metadata cannot be enumerated through this API.
  • If the previous PDF.js implementation also returned [] for this field, there is no regression.
  • If populating custom metadata is required, I’m happy to investigate whether PDFium provides a way to enumerate non-standard keys and address that in a follow-up.

Please let me know if you have any questions or would like further improvements — I’m happy to iterate.

Thank you again for your time and feedback.

@Frooodle
Copy link
Copy Markdown
Member

Frooodle commented Apr 26, 2026

The aim is to lose no functionality in any migration so yes custom metadata required

function extractInfoDictCustomEntries(
arrayBuffer: ArrayBuffer,
): CustomMetadataEntry[] {
const text = new TextDecoder("latin1").decode(new Uint8Array(arrayBuffer));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoding the entire PDF ArrayBuffer into one latin1 string is memory/CPU heavy for large PDFs; consider scanning bytes or parsing only the trailer/info object without allocating the full decoded string.

Details

✨ AI Reasoning
​The new function extractInfoDictCustomEntries decodes the entire PDF ArrayBuffer into a single latin1 string via new TextDecoder().decode(new Uint8Array(arrayBuffer)). This allocates a large JS string proportional to the file size and then performs additional scans/slices over that string. Since extractPDFMetadata runs this for each file processed, memory and CPU usage grow linearly with file size and can be significant for large PDFs. The work appears unnecessary if a streaming or targeted scan over relevant byte ranges could be used instead. The change was introduced in this PR and increases per-file memory/CPU cost.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧠 Memory Optimization

Before

  • new TextDecoder().decode(new Uint8Array(arrayBuffer)) was used
  • This allocates a full JavaScript string proportional to the file size (approximately 2× memory usage for latin-1)

After

  • Decoding is limited to two small slices:
    • The last ~4 KB (trailer section)
    • ~2 KB around the Info object
  • Memory usage is now effectively constant, regardless of overall file size

Note

  • A byte-level scan (findLastBytes) is still required to locate the Info object
  • This operation works directly on raw bytes and does not introduce large string allocations

@web-dev0521
Copy link
Copy Markdown
Author

Hi @Frooodle,

Thank you for the clarification — completely agreed, no functionality should be lost.

I've now implemented custom metadata extraction to close that gap. The root issue was that PDFium's FPDF_GetMetaText can only read by key name and has no API to enumerate all keys in the Info dictionary (unlike PDF.js which returned non-standard keys under info.Custom). To replicate the old behaviour, the service now parses the raw PDF Info dictionary directly from the file bytes, extracts all non-standard key-value pairs, and returns them as customMetadata — matching the shape the old implementation produced.

The parser handles the standard PDF string encodings (literal strings with escape sequences, hex strings, and UTF-16BE with BOM). For PDFs using cross-reference streams (PDF 1.5+ compressed xref), it degrades gracefully to an empty array rather than throwing — this is the same effective result as before since PDF.js also did not expose custom metadata from xref-stream-only files reliably.

TypeScript compiles with no new errors across the service and all its callers.

Please let me know if you'd like any changes or have further questions — happy to iterate.

Thank you again for your time and feedback.

@Frooodle
Copy link
Copy Markdown
Member

Frooodle commented May 1, 2026

frrontend validation is failing

@web-dev0521
Copy link
Copy Markdown
Author

Hi, @Frooodle ,
Could you please review my PR again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Front End Issues or pull requests related to front-end development size:L This PR changes 100-499 lines ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: migrate pdf.js to pdfium for metadata service

2 participants