Skip to content

Commit 3db6d91

Browse files
author
Andrey
committed
Simplified the text extraction API
1 parent f1ed753 commit 3db6d91

File tree

2 files changed

+49
-26
lines changed

2 files changed

+49
-26
lines changed

files/webviewer.pdf.txt

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
A lower-quality library also encounters
2+
performance and memory issues, such as large
3+
documents with frustratingly long wait times for
4+
your users as well as complex documents that
5+
crash the viewer. This is often due to the absence
6+
of features such as PDF tiling, parallelization,
7+
and linearization that a more mature PDF SDK
8+
will incorporate.
9+
Some solutions (e.g., image servers) perform
10+
excellently when tested on a small number of
11+
documents and users but then inflict unexpected
12+
hidden costs when scaled up. When hundreds
13+
or thousands of users later view, mark up, comment
14+
on, and otherwise interact with (i.e.,scroll,
15+
pan, and zoom) documents, server resource and
16+
network data usage explodes. To maintain your
17+
desired UX, you have to pay higher fees or invest
18+
in more servers.
19+
The following types of documents have much
20+
more demanding rendering requirements:
21+
• CAD-based PDFs such as construction and
22+
engineering drawings with very large and
23+
complex designs.
24+
• Reports, textbooks, and marketing material
25+
using advanced PDF graphics such as shadings,
26+
gradients, soft masks, and patterns.
27+
• Geospatial maps with OCG layers that are
28+
switched off by default.
29+
• Pre-press documents which require an SDK
30+
with advanced color management features to
31+
print colors accurately.
32+
• High-speed accurate rendering (especially on
33+
native mobile apps and mobile browsers).
34+
• Context extraction of tables, text, etc. with
35+
document structure (e.g., text read order or
36+
table arrangement) in tact.
37+
To prevent crashes, slowness, and rendering
38+
issues from disrupting your UX, test functionality
39+
with the types of documents your users will work
40+
with. Also test a server-based solution at the
41+
anticipated load and usage.
42+
6

index.js

Lines changed: 7 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -128,9 +128,8 @@ app.get('/convert/:filename', (req, res) => {
128128
PDFNetEndpoint(main, outputPath, res);
129129
});
130130

131-
app.get('/textextract/:filename-:outext-:pagenumber', (req, res) => {
131+
app.get('/textextract/:filename-:pagenumber', (req, res) => {
132132
const filename = req.params.filename;
133-
let outputExt = req.params.outext;
134133
let pageNumber = Number(req.params.pagenumber);
135134
let ext = path.parse(filename).ext;
136135

@@ -139,16 +138,8 @@ app.get('/textextract/:filename-:outext-:pagenumber', (req, res) => {
139138
res.end(`File is not a PDF. Please convert it first.`);
140139
}
141140

142-
if (!outputExt) {
143-
outputExt = 'txt';
144-
}
145-
146141
const inputPath = path.resolve(__dirname, filesPath, filename);
147-
const outputPath = path.resolve(
148-
__dirname,
149-
filesPath,
150-
`${filename}.${outputExt}`,
151-
);
142+
const outputPath = path.resolve(__dirname, filesPath, `${filename}.txt`);
152143

153144
const main = async () => {
154145
await PDFNet.initialize();
@@ -167,21 +158,11 @@ app.get('/textextract/:filename-:outext-:pagenumber', (req, res) => {
167158
const rect = new PDFNet.Rect(0, 0, 612, 794);
168159
txt.begin(page, rect);
169160
let text;
170-
if (outputExt === 'xml') {
171-
text = await txt.getAsXML(
172-
PDFNet.TextExtractor.XMLOutputFlags.e_words_as_elements |
173-
PDFNet.TextExtractor.XMLOutputFlags.e_output_bbox |
174-
PDFNet.TextExtractor.XMLOutputFlags.e_output_style_info,
175-
);
176-
fs.writeFile(outputPath, text, (err) => {
177-
if (err) return console.log(err);
178-
});
179-
} else {
180-
text = await txt.getAsText();
181-
fs.writeFile(outputPath, text, (err) => {
182-
if (err) return console.log(err);
183-
});
184-
}
161+
162+
text = await txt.getAsText();
163+
fs.writeFile(outputPath, text, (err) => {
164+
if (err) return console.log(err);
165+
});
185166
await PDFNet.endDeallocateStack();
186167
} catch (err) {
187168
console.log(err);

0 commit comments

Comments
 (0)