PDF extractor using PDFBox.
The jar file can be found at releases.
Extract texts, draws and images from PDF.
java -classpath pdfextract.jar PDFExtractor <file or directory> <options...>
-text: extracts texts-bounding: extracts bounding coordinates-glyph: extracts glyph coordinates-fontName: extracts fontName
-draw: extracts draws-image: extracts images
For example,
java -classpath pdfextract.jar PDFExtractor xxx.pdf -text -bounding
extracts only texts with bounding coordinates from xxx.pdf.
In the figure, blue square indicates bounding coordinates, and red square indicates glyph coordinates.
Each line is either one of "TEXT", "DRAW", "IMAGE", or empty.
- Page number
- "TEXT"
- Character
- [Optional] bounding x coordinate
- [Optional] bounding y coordinate
- [Optional] bounding width
- [Optional] bounding height
- [Optional] glyph x coordinate
- [Optional] glyph y coordinate
- [Optional] glyph width
- [Optional] glyph height
- [Optional] Font name
- Page number
- "DRAW"
- Operation ("LINE_TO", "CURVE_TO", etc.)
- Page number
- "IMAGE"
- x coordinate
- y coordinate
- width
- height
1 TEXT P 106.4301 754.63226 5.478471 10.705882 106.4301 757.06213 5.424672 5.8550596 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT r 111.90857 754.63226 3.4879298 10.705882 112.31206 758.963 3.290669 3.9541826 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT o 114.99301 754.63226 4.4832 10.705882 115.23511 758.963 3.9541826 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT c 119.47621 754.63226 3.981082 10.705882 119.7452 758.963 3.5417283 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT e 123.45729 754.63226 3.981082 10.705882 123.73525 758.963 3.4161987 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT e 127.43837 754.63226 3.981082 10.705882 127.71633 758.963 3.4161987 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT d 131.41945 754.63226 4.4832 10.705882 131.55394 756.79315 4.590797 6.240615 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT i 135.90265 754.63226 2.4926593 10.705882 136.342 757.05316 1.9277761 5.9626565 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT n 138.39531 754.63226 4.4832 10.705882 138.52084 758.963 4.124544 4.03488 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT g 142.87851 754.63226 4.4832 10.705882 142.95024 758.963 4.16041 5.801261 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT s 147.36171 754.63226 3.4879298 10.705882 147.50517 758.95404 3.13824 4.0797124 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT o 153.09125 754.63226 4.4832 10.705882 153.33334 758.963 3.9541826 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT f 157.57445 754.63226 2.4926593 10.705882 156.2564 756.83795 5.119815 7.9352646 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT t 162.30872 754.63226 2.4926593 10.705882 162.64047 758.02155 2.3222978 4.994285 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT h 164.80138 754.63226 4.4832 10.705882 164.97174 756.79315 4.1155777 6.204749 LMQTGC+NimbusRomNo9L-ReguItal
1 TEXT e 169.28458 754.63226 3.981082 10.705882 169.56253 758.963 3.4161987 4.052813 LMQTGC+NimbusRomNo9L-ReguItal
