|
| 1 | +# simple-pdf |
| 2 | + |
| 3 | +`simple-pdf` aims to be a simple drop-in module for extracting text and images |
| 4 | +from PDF files. It exposes a promise-based and an event-based API. |
| 5 | + |
| 6 | +## Table of contents |
| 7 | +- [Features](#features) |
| 8 | +- [Reasons not to use this library](#reasons-not-to-use-this-library) |
| 9 | +- [Minimal example](#minimal-example) |
| 10 | +- [Installation](#installation) |
| 11 | +- [Docs](#docs) |
| 12 | + - [Options](#options) |
| 13 | + - [Basic parsing](#basic-parsing) |
| 14 | + - [Advanced parsing](#advanced-parsing) |
| 15 | +- [Contributing](#contributing) |
| 16 | +- [License](#license) |
| 17 | + |
| 18 | +## Features |
| 19 | + |
| 20 | +- Extracts both text and images |
| 21 | +- Handles most image encodings |
| 22 | + |
| 23 | +## Reasons not to use this library |
| 24 | + |
| 25 | +Let's be real. This might not be the library for you. Here are a few reasons why. |
| 26 | + |
| 27 | +- **Slow with images** - Images can be embedded in a PDF in many different ways. To ensure that all types of images can be extracted we render the whole PDF and then use [sharp](https://github.com/lovell/sharp) to extract the images from the rendered page. This adds extra processing time for pages that contains images (provided that you don't disable image extraction). |
| 28 | +- **New to the game** - This library is brand new and haven't been battle tested yet. If you're looking for a reliable solution, this library might not be the best choice for you. |
| 29 | +- **No automated testing** - Though I'm working on this 🙃 |
| 30 | + |
| 31 | +## Minimal example |
| 32 | + |
| 33 | +More examples can be found in the `examples` directory |
| 34 | + |
| 35 | +```javascript |
| 36 | +const fs = require('fs'); |
| 37 | +const { SimplePDFParser } = require('simple-pdf'); |
| 38 | + |
| 39 | +const fileBuffer = fs.readFileSync('somefile.pdf'); |
| 40 | + |
| 41 | +const parser = new SimplePDFParser(fileBuffer); |
| 42 | + |
| 43 | +parser.parse().then((result) => { |
| 44 | + console.log(result) |
| 45 | +}); |
| 46 | +``` |
| 47 | + |
| 48 | +## Installation |
| 49 | + |
| 50 | +```bash |
| 51 | +npm i simple-pdf |
| 52 | +``` |
| 53 | + |
| 54 | +## Docs |
| 55 | + |
| 56 | +The only exposed interface is the `SimplePDFParser` class. It takes a `Buffer` containing a PDF file as well as an optional options object. |
| 57 | + |
| 58 | +```javascript |
| 59 | +new SimplePDFParser(fileBuffer, { |
| 60 | + // options |
| 61 | +}) |
| 62 | +``` |
| 63 | + |
| 64 | +### Options |
| 65 | +|Option|Value type|Default value|Description| |
| 66 | +|-|-|-|-| |
| 67 | +|`paragraphThreshold`|integer|`25`|The minimum distance between two lines on the y-axis to consider them part of separate paragraphs. This option only affects the `parse` method. |
| 68 | +|`lineThreshold`|integer|`1`|The minimum distance between two lines on the y-axis to consider them part of the same line. PDFs usually suffer from issues with floating point numbers. This value is used to give a little room for error. You shouldn't have to change this value unless you're dealing with PDFs generated with OCR or other odd PDFs. |
| 69 | +|`imageScale`|integer|`2`|Scaling applied to the PDF before extrating images. Higher value results in greater image resolution, but quadratically increases rendering times. |
| 70 | +|`extractImages`|boolean|`true`|Controls whether or not to extract images. Image extraction requires rendering of each page, which might take a long time depending on the size of the PDF, configured `imageScale` and underlying hardware. If you don't need to extract images, setting this option to `false` is recommended. |
| 71 | +|`ignoreEmptyText`|boolean|`true`|Controls whether or not to ignore empty text elements. Text elements are considered empty if their text content contains nothing by whitespace. |
| 72 | +|`joinParagraphs`|boolean|`false`|Controls whether or not to join paragraphs. Enabling this option will join each line that's not separated by a non-text element (paragraph break or image) which will effectively make each line contain a paragraph. Paragraph breaks will be omitted from the final output. This option only affects the `parse` method. |
| 73 | + |
| 74 | +### Basic parsing |
| 75 | + |
| 76 | +This is probaly the easiest way to use this library. It parses all pages in parallel and returns the result when finished. Paragraphs and lines are automatically joined based on the options passed to the constructor. |
| 77 | + |
| 78 | +*Example:* |
| 79 | +```javascript |
| 80 | +const parser = new SimplePDFParser(fileBuffer) |
| 81 | + |
| 82 | +const result = await parser.parse() |
| 83 | +``` |
| 84 | + |
| 85 | +*Result:* |
| 86 | +```javascript |
| 87 | +[ |
| 88 | + { |
| 89 | + "type": "text", |
| 90 | + "pageIndex": 0, |
| 91 | + "items": [ |
| 92 | + { |
| 93 | + "text": "Lorem ipsum", |
| 94 | + "font": "g_d0_f1" |
| 95 | + } |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "type": "image", |
| 100 | + "pageIndex": 0, |
| 101 | + "imageBuffer": Buffer |
| 102 | + } |
| 103 | +] |
| 104 | +``` |
| 105 | + |
| 106 | +### Advanced parsing |
| 107 | + |
| 108 | +If you need more granuar control of the resulting data structure you might want to use the advanced parsing. You can choose to either just await the result or use the events to process each page as it is finished parsing. Note that pages are not guaranteed to be returned in order. |
| 109 | + |
| 110 | +*Example:* |
| 111 | +```javascript |
| 112 | +const parser = new SimplePDFParser(fileBuffer) |
| 113 | + |
| 114 | +// Called with each page |
| 115 | +parser.on('page', (page) => { |
| 116 | + console.log(`Page ${page.index}:`); |
| 117 | + console.log('Text elements: ', page.textElements); |
| 118 | + console.log('Image elements:', page.imageElements); |
| 119 | +}); |
| 120 | + |
| 121 | +// Called when the parsing is finished |
| 122 | +parser.on('done', () => { |
| 123 | + console.log('Parser done'); |
| 124 | +}); |
| 125 | + |
| 126 | +// This must be run even if you just use the events API, but then you may ignore the return value |
| 127 | +const result = await parser.parseRaw() |
| 128 | +``` |
| 129 | + |
| 130 | +*Result (each page):* |
| 131 | + |
| 132 | +```javascript |
| 133 | +{ |
| 134 | + index: 0, // Page index |
| 135 | + textElements: [{ |
| 136 | + x: 123.456, |
| 137 | + y: 654.321, |
| 138 | + items: [{ |
| 139 | + text: 'Lorem ipsum', |
| 140 | + font: 'g_d0_f1' |
| 141 | + }] |
| 142 | + }], |
| 143 | + imageElements: [{ |
| 144 | + x: 4.2, |
| 145 | + y: 83.11, |
| 146 | + width: 120, |
| 147 | + height: 80, |
| 148 | + imageBuffer: Buffer |
| 149 | + }] |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +## Contributing |
| 154 | + |
| 155 | +Contributions and PRs are very welcome! PRs should go towards the `develop` branch. |
| 156 | + |
| 157 | +We use the [Airbnb style guide](https://github.com/airbnb/javascript). Please |
| 158 | +run ESLint before committing any changes: |
| 159 | +```bash |
| 160 | +npx eslint src |
| 161 | +npx eslint src --fix |
| 162 | +``` |
| 163 | + |
| 164 | +## License |
| 165 | + |
| 166 | +This project is licensed under the MIT license. |
0 commit comments