feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238
feat(frontend): migrate pdfMetadataService from PDF.js to PDFium#6238web-dev0521 wants to merge 11 commits intoStirling-Tools:mainfrom
Conversation
Replaces pdfWorkerManager / pdfjs-dist with the existing pdfiumService primitives (openRawDocument, FPDF_GetMetaText, closeDocAndFreeBuffer). All standard metadata fields (title, author, subject, keywords, creator, producer, creationDate, modDate, trapped) are read via FPDF_GetMetaText. customMetadata returns an empty array — the PDFium C API does not expose enumeration of arbitrary /Info dictionary keys. Closes Stirling-Tools#6232
|
Hello @Frooodle, |
|
Have you tested this locally and compared outputs and ensure it's same? |
|
Hi @Frooodle, Thank you for your consideration. I've thoroughly tested the changes again and wanted to share the details below. ✅ Acceptance Criteria Verified
|
|
The aim is to lose no functionality in any migration so yes custom metadata required |
| function extractInfoDictCustomEntries( | ||
| arrayBuffer: ArrayBuffer, | ||
| ): CustomMetadataEntry[] { | ||
| const text = new TextDecoder("latin1").decode(new Uint8Array(arrayBuffer)); |
There was a problem hiding this comment.
Decoding the entire PDF ArrayBuffer into one latin1 string is memory/CPU heavy for large PDFs; consider scanning bytes or parsing only the trailer/info object without allocating the full decoded string.
Details
✨ AI Reasoning
The new function extractInfoDictCustomEntries decodes the entire PDF ArrayBuffer into a single latin1 string via new TextDecoder().decode(new Uint8Array(arrayBuffer)). This allocates a large JS string proportional to the file size and then performs additional scans/slices over that string. Since extractPDFMetadata runs this for each file processed, memory and CPU usage grow linearly with file size and can be significant for large PDFs. The work appears unnecessary if a streaming or targeted scan over relevant byte ranges could be used instead. The change was introduced in this PR and increases per-file memory/CPU cost.
🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
There was a problem hiding this comment.
🧠 Memory Optimization
Before
new TextDecoder().decode(new Uint8Array(arrayBuffer))was used- This allocates a full JavaScript string proportional to the file size (approximately 2× memory usage for latin-1)
After
- Decoding is limited to two small slices:
- The last ~4 KB (trailer section)
- ~2 KB around the Info object
- Memory usage is now effectively constant, regardless of overall file size
Note
- A byte-level scan (
findLastBytes) is still required to locate the Info object - This operation works directly on raw bytes and does not introduce large string allocations
|
Hi @Frooodle, Thank you for the clarification — completely agreed, no functionality should be lost. I've now implemented custom metadata extraction to close that gap. The root issue was that PDFium's FPDF_GetMetaText can only read by key name and has no API to enumerate all keys in the Info dictionary (unlike PDF.js which returned non-standard keys under info.Custom). To replicate the old behaviour, the service now parses the raw PDF Info dictionary directly from the file bytes, extracts all non-standard key-value pairs, and returns them as customMetadata — matching the shape the old implementation produced. The parser handles the standard PDF string encodings (literal strings with escape sequences, hex strings, and UTF-16BE with BOM). For PDFs using cross-reference streams (PDF 1.5+ compressed xref), it degrades gracefully to an empty array rather than throwing — this is the same effective result as before since PDF.js also did not expose custom metadata from xref-stream-only files reliably. TypeScript compiles with no new errors across the service and all its callers. Please let me know if you'd like any changes or have further questions — happy to iterate. Thank you again for your time and feedback. |
|
frrontend validation is failing |
|
Hi, @Frooodle , |
Description of Changes
pdfMetadataService.tsfrom PDF.js (pdfWorkerManager/pdfjs-dist) to PDFium WASM (@embedpdf/pdfium) using the existingpdfiumServiceinfrastructureFPDF_GetMetaTextconvertTrappedStatusadapted to handle PDFium's plain string return ("True"/"False") instead of PDF.js's{name: "True"}objectcustomMetadatareturns an empty array — the PDFium C API provides no mechanism to enumerate arbitrary/Infodictionary keysMetadataExtractionResponseinterface is unchanged; no callers required updatesCloses #6232
Checklist
General
Documentation
Translations (if applicable)
scripts/counter_translation.pyUI Changes (if applicable)
Testing (if applicable)
task checkto verify linters, typechecks, and tests pass