You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user/extract-text.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
72
72
73
73
### Example 1: Ignore header and footer
74
74
75
-
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
75
+
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y > 720) and footer (y < 50).
76
76
77
77
```python
78
78
from pypdf import PdfReader
@@ -171,17 +171,17 @@ Then there are issues where most people would agree on the correct output, but
171
171
the way PDF stores information just makes it hard to achieve that:
172
172
173
173
1.**Tables**: Typically, tables are just absolutely positioned text. In the worst
174
-
case, ever single letter could be absolutely positioned. That makes it hard
174
+
case, every single letter could be absolutely positioned. That makes it hard
175
175
to tell where columns / rows are.
176
-
2.**Images**: Sometimes PDFs do not contain the text as it's displayed, but
176
+
2.**Images**: Sometimes PDFs do not contain the text as it is displayed, but
177
177
instead an image. You notice that when you cannot copy the text. Then there
178
178
are PDF files that contain an image and a text layer in the background.
179
179
That typically happens when a document was scanned. Although the scanning
180
180
software (OCR) is pretty good today, it still fails once in a while. pypdf
181
181
is no OCR software; it will not be able to detect those failures. pypdf
182
182
will also never be able to extract text from images.
183
183
184
-
And finally there are issues that pypdf will deal with. If you find such a
184
+
Finally there are issues that pypdf will deal with. If you find such a
185
185
text extraction bug, please share the PDF with us so we can work on it!
186
186
187
187
### Missing Semantic Layer
@@ -196,7 +196,7 @@ find heuristics to make educated guesses, but there is no way of being certain.
196
196
197
197
This is a shortcoming of the PDF file format, not of pypdf.
198
198
199
-
It would be possible to apply machine learning on PDF documents to make good
199
+
It is possible to apply machine learning on PDF documents to make good
200
200
heuristics, but that will not be part of pypdf. However, pypdf could be used to
201
201
feed such a machine learning system with the relevant information.
202
202
@@ -229,7 +229,7 @@ More information:
229
229
Optical Character Recognition (OCR) is the process of extracting text from
230
230
images. Software which does this is called *OCR software*. The
231
231
[tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) is the
232
-
most commonly known Open Source OCR software.
232
+
most commonly known open source OCR software.
233
233
234
234
pypdf is **not** OCR software.
235
235
@@ -279,7 +279,7 @@ pypdf also has an edge when it comes to characters which are rare, e.g.
279
279
280
280
## Attempts to prevent text extraction
281
281
282
-
If people who share PDF documents want to prevent text extraction, there are
282
+
If people who share PDF documents want to prevent text extraction, they have
0 commit comments