Custom n8n node for PDF inspection and splitting using pure npm packages.
- Analyzes PDF structure
- Counts pages
- Detects if PDF is vectorial (text-based) or rasterized (image-based)
- Extracts text from first page
- Performance: Very fast (tens of milliseconds)
- Splits multi-page PDFs into individual pages
- Creates one output item per page
- Preserves PDF quality and structure
npm install n8n-nodes-pdf-utils- Clone this repository
- Install dependencies:
npm install
- Build the node:
npm run build
- Link to your n8n installation:
npm link cd ~/.n8n/nodes npm link n8n-nodes-pdf-utils
- Restart n8n
- Go to Settings > Community Nodes
- Click Install
- Enter:
n8n-nodes-pdf-utils - Click Install
Input: Binary data containing a PDF file
Parameters:
Binary Property: Name of the binary property (default: "data")Text Threshold: Minimum text length to consider PDF as vectorial (default: 50)
Output: Single item with analysis + original PDF binary
{
"json": {
"pageCount": 5,
"isMultiPage": true,
"isVectorial": false,
"textLength": 23,
"firstPageText": "Preview of first 200 characters..."
},
"binary": {
"data": "<original PDF>"
}
}Example workflow:
HTTP Request (download PDF)
→ PDF Utils (Inspect)
→ IF (isVectorial)
→ Route A (text processing with PDF)
→ Route B (OCR processing with PDF)
Input: Binary data containing a PDF file
Parameters:
Binary Property: Name of the binary property (default: "data")Text Threshold: Minimum text length to consider PDF as vectorial (default: 50)Output Binary Property: Name for output binary property (default: "data")
Output:
- If vectorial: Single item with analysis + original PDF (pass-through)
- If not vectorial: Multiple items, one per page (split)
Example workflow:
HTTP Request (download PDF)
→ PDF Utils (Inspect and Split)
→ Vectorial PDFs pass through as-is
→ Scanned PDFs split into pages automatically
Use case: Automatically handle different PDF types without manual branching:
- Text-based PDFs (vectorial) → process as whole document
- Scanned PDFs (non-vectorial) → OCR each page individually
Input: Binary data containing a multi-page PDF
Parameters:
Binary Property: Name of the input binary property (default: "data")Output Binary Property: Name for output binary property (default: "data")
Output: Multiple items, one per page
- Each item contains binary data with a single-page PDF
- JSON includes
pageNumberandoriginalFileName
Example workflow:
HTTP Request (download PDF)
→ PDF Utils (Split)
→ Loop Over Items
→ Process each page individually
pdfjs-dist(v5.4.394): For PDF analysis and text extraction (uses legacy build for Node.js)pdf-lib(v1.17.1): For PDF manipulation and splitting
- pdfjs-dist: Mozilla's PDF.js library - battle-tested, used in Firefox (headless mode, no canvas needed). We use the legacy build (
pdfjs-dist/legacy/build/pdf.mjs) which is specifically designed for Node.js environments without DOM dependencies. - pdf-lib: Pure JavaScript, no native dependencies, excellent for manipulation
- 100% npm packages: No system-level dependencies (like Poppler, Ghostscript) and no canvas/native modules!
- Inspect: Very fast (~10-50ms for typical PDFs)
- Split: Fast, scales linearly with page count (~50-200ms per page)
# Install dependencies
npm install
# Build
npm run build
# Watch mode for development
npm run dev
# Lint
npm run lint
# Format code
npm run format- Ensure n8n is restarted after installation
- Check that the node is in
~/.n8n/nodesor installed globally - Verify
package.jsonhas correctn8n.nodesconfiguration
If you encounter issues with pdfjs-dist, ensure you're using Node.js 16 or higher:
node --version # Should be v16.0.0 or higherMIT
Roberto Michelena - INFINITEK S.A.C.
Contributions are welcome! Please open an issue or submit a pull request.
- Add merge operation
- Add extract pages by range
- Add rotate pages operation
- Add compress PDF operation
- Add watermark operation