Skip to content

Commit 7a19ee0

Browse files
committed
Initial commit
0 parents  commit 7a19ee0

18 files changed

+6737
-0
lines changed

.babelrc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"presets": [["airbnb", {
3+
"targets": {
4+
"node": 8
5+
}
6+
}]]
7+
}

.eslintrc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"extends": "airbnb-base"
3+
}

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
node_modules
2+
*.log

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2020 Malcolm Nihlén
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# simple-pdf
2+
3+
`simple-pdf` aims to be a simple drop-in module for extracting text and images
4+
from PDF files. It exposes a promise-based and an event-based API.
5+
6+
## Table of contents
7+
- [Features](#features)
8+
- [Reasons not to use this library](#reasons-not-to-use-this-library)
9+
- [Minimal example](#minimal-example)
10+
- [Installation](#installation)
11+
- [Docs](#docs)
12+
- [Options](#options)
13+
- [Basic parsing](#basic-parsing)
14+
- [Advanced parsing](#advanced-parsing)
15+
- [Contributing](#contributing)
16+
- [License](#license)
17+
18+
## Features
19+
20+
- Extracts both text and images
21+
- Handles most image encodings
22+
23+
## Reasons not to use this library
24+
25+
Let's be real. This might not be the library for you. Here are a few reasons why.
26+
27+
- **Slow with images** - Images can be embedded in a PDF in many different ways. To ensure that all types of images can be extracted we render the whole PDF and then use [sharp](https://github.com/lovell/sharp) to extract the images from the rendered page. This adds extra processing time for pages that contains images (provided that you don't disable image extraction).
28+
- **New to the game** - This library is brand new and haven't been battle tested yet. If you're looking for a reliable solution, this library might not be the best choice for you.
29+
- **No automated testing** - Though I'm working on this 🙃
30+
31+
## Minimal example
32+
33+
More examples can be found in the `examples` directory
34+
35+
```javascript
36+
const fs = require('fs');
37+
const { SimplePDFParser } = require('simple-pdf');
38+
39+
const fileBuffer = fs.readFileSync('somefile.pdf');
40+
41+
const parser = new SimplePDFParser(fileBuffer);
42+
43+
parser.parse().then((result) => {
44+
console.log(result)
45+
});
46+
```
47+
48+
## Installation
49+
50+
```bash
51+
npm i simple-pdf
52+
```
53+
54+
## Docs
55+
56+
The only exposed interface is the `SimplePDFParser` class. It takes a `Buffer` containing a PDF file as well as an optional options object.
57+
58+
```javascript
59+
new SimplePDFParser(fileBuffer, {
60+
// options
61+
})
62+
```
63+
64+
### Options
65+
|Option|Value type|Default value|Description|
66+
|-|-|-|-|
67+
|`paragraphThreshold`|integer|`25`|The minimum distance between two lines on the y-axis to consider them part of separate paragraphs. This option only affects the `parse` method.
68+
|`lineThreshold`|integer|`1`|The minimum distance between two lines on the y-axis to consider them part of the same line. PDFs usually suffer from issues with floating point numbers. This value is used to give a little room for error. You shouldn't have to change this value unless you're dealing with PDFs generated with OCR or other odd PDFs.
69+
|`imageScale`|integer|`2`|Scaling applied to the PDF before extrating images. Higher value results in greater image resolution, but quadratically increases rendering times.
70+
|`extractImages`|boolean|`true`|Controls whether or not to extract images. Image extraction requires rendering of each page, which might take a long time depending on the size of the PDF, configured `imageScale` and underlying hardware. If you don't need to extract images, setting this option to `false` is recommended.
71+
|`ignoreEmptyText`|boolean|`true`|Controls whether or not to ignore empty text elements. Text elements are considered empty if their text content contains nothing by whitespace.
72+
|`joinParagraphs`|boolean|`false`|Controls whether or not to join paragraphs. Enabling this option will join each line that's not separated by a non-text element (paragraph break or image) which will effectively make each line contain a paragraph. Paragraph breaks will be omitted from the final output. This option only affects the `parse` method.
73+
74+
### Basic parsing
75+
76+
This is probaly the easiest way to use this library. It parses all pages in parallel and returns the result when finished. Paragraphs and lines are automatically joined based on the options passed to the constructor.
77+
78+
*Example:*
79+
```javascript
80+
const parser = new SimplePDFParser(fileBuffer)
81+
82+
const result = await parser.parse()
83+
```
84+
85+
*Result:*
86+
```javascript
87+
[
88+
{
89+
"type": "text",
90+
"pageIndex": 0,
91+
"items": [
92+
{
93+
"text": "Lorem ipsum",
94+
"font": "g_d0_f1"
95+
}
96+
]
97+
},
98+
{
99+
"type": "image",
100+
"pageIndex": 0,
101+
"imageBuffer": Buffer
102+
}
103+
]
104+
```
105+
106+
### Advanced parsing
107+
108+
If you need more granuar control of the resulting data structure you might want to use the advanced parsing. You can choose to either just await the result or use the events to process each page as it is finished parsing. Note that pages are not guaranteed to be returned in order.
109+
110+
*Example:*
111+
```javascript
112+
const parser = new SimplePDFParser(fileBuffer)
113+
114+
// Called with each page
115+
parser.on('page', (page) => {
116+
console.log(`Page ${page.index}:`);
117+
console.log('Text elements: ', page.textElements);
118+
console.log('Image elements:', page.imageElements);
119+
});
120+
121+
// Called when the parsing is finished
122+
parser.on('done', () => {
123+
console.log('Parser done');
124+
});
125+
126+
// This must be run even if you just use the events API, but then you may ignore the return value
127+
const result = await parser.parseRaw()
128+
```
129+
130+
*Result (each page):*
131+
132+
```javascript
133+
{
134+
index: 0, // Page index
135+
textElements: [{
136+
x: 123.456,
137+
y: 654.321,
138+
items: [{
139+
text: 'Lorem ipsum',
140+
font: 'g_d0_f1'
141+
}]
142+
}],
143+
imageElements: [{
144+
x: 4.2,
145+
y: 83.11,
146+
width: 120,
147+
height: 80,
148+
imageBuffer: Buffer
149+
}]
150+
}
151+
```
152+
153+
## Contributing
154+
155+
Contributions and PRs are very welcome! PRs should go towards the `develop` branch.
156+
157+
We use the [Airbnb style guide](https://github.com/airbnb/javascript). Please
158+
run ESLint before committing any changes:
159+
```bash
160+
npx eslint src
161+
npx eslint src --fix
162+
```
163+
164+
## License
165+
166+
This project is licensed under the MIT license.

examples/events.js

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
/* eslint-disable no-restricted-syntax */
2+
3+
const fs = require('fs');
4+
const util = require('util');
5+
const path = require('path');
6+
7+
const { SimplePDFParser } = require('../lib');
8+
9+
// Promisifying fs.readFile so that we can use it with promises
10+
const readFile = util.promisify(fs.readFile);
11+
12+
// Wrap everything in an async function so that we can use async/await
13+
async function start() {
14+
// const fileBuffer = await readFile(path.join(__dirname, './pdfs/text-only.pdf'));
15+
const fileBuffer = await readFile(path.join(__dirname, './pdfs/images-and-formatting.pdf'));
16+
17+
// Create a new parser without options
18+
const parser = new SimplePDFParser(fileBuffer);
19+
20+
// Called with each page
21+
parser.on('page', (page) => {
22+
console.log(`Page ${page.index}:`);
23+
console.log('Text elements: ', page.textElements);
24+
console.log('Image elements:', page.imageElements);
25+
});
26+
27+
// Called when the parsing is finished
28+
parser.on('done', () => {
29+
console.log('Parser done');
30+
});
31+
32+
// Start the parser and wait for it to finish
33+
await parser.parseRaw();
34+
}
35+
36+
// Start the program
37+
start();
546 KB
Binary file not shown.

examples/pdfs/text-only.pdf

26.4 KB
Binary file not shown.

examples/promises.js

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
/* eslint-disable no-restricted-syntax */
2+
3+
const fs = require('fs');
4+
const util = require('util');
5+
const path = require('path');
6+
7+
const { SimplePDFParser } = require('../lib');
8+
9+
// Promisifying fs.readFile so that we can use it with promises
10+
const readFile = util.promisify(fs.readFile);
11+
12+
// Wrap everything in an async function so that we can use async/await
13+
async function start() {
14+
// const fileBuffer = await readFile(path.join(__dirname, './pdfs/text-only.pdf'));
15+
const fileBuffer = await readFile(path.join(__dirname, './pdfs/images-and-formatting.pdf'));
16+
17+
// Create a new parser with joinParagraphs enabled
18+
const parser = new SimplePDFParser(fileBuffer, {
19+
joinParagraphs: true,
20+
});
21+
22+
// Run the parser
23+
const result = await parser.parse();
24+
25+
// Print each line
26+
for (const line of result) {
27+
// If it's a text line, print it. Else, print the type.
28+
if (line.type === 'text') {
29+
console.log(line.items.map((item) => item.text).join(''));
30+
} else {
31+
console.log(`[${line.type}]`);
32+
}
33+
}
34+
}
35+
36+
// Start the program
37+
start();

lib/NodeCanvasFactory.js

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"use strict";
2+
3+
var _interopRequireDefault = require("@babel/runtime/helpers/interopRequireDefault");
4+
5+
Object.defineProperty(exports, "__esModule", {
6+
value: true
7+
});
8+
exports["default"] = void 0;
9+
10+
var _canvas = _interopRequireDefault(require("canvas"));
11+
12+
var _assert = require("assert");
13+
14+
/* eslint-disable no-param-reassign */
15+
16+
/* eslint-disable class-methods-use-this */
17+
18+
/*
19+
* This code was taken from https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js
20+
*/
21+
class NodeCanvasFactory {
22+
create(width, height) {
23+
(0, _assert.strict)(width > 0 && height > 0, 'Invalid canvas size');
24+
25+
const canvas = _canvas["default"].createCanvas(width, height);
26+
27+
const context = canvas.getContext('2d');
28+
return {
29+
canvas,
30+
context
31+
};
32+
}
33+
34+
reset(canvasAndContext, width, height) {
35+
(0, _assert.strict)(canvasAndContext.canvas, 'Canvas is not specified');
36+
(0, _assert.strict)(width > 0 && height > 0, 'Invalid canvas size');
37+
canvasAndContext.canvas.width = width;
38+
canvasAndContext.canvas.height = height;
39+
}
40+
41+
destroy(canvasAndContext) {
42+
(0, _assert.strict)(canvasAndContext.canvas, 'Canvas is not specified'); // Zeroing the width and height cause Firefox to release graphics
43+
// resources immediately, which can greatly reduce memory consumption.
44+
45+
canvasAndContext.canvas.width = 0;
46+
canvasAndContext.canvas.height = 0;
47+
canvasAndContext.canvas = null;
48+
canvasAndContext.context = null;
49+
}
50+
51+
}
52+
53+
exports["default"] = NodeCanvasFactory;

0 commit comments

Comments
 (0)