Skip to content

Commit a435eaa

Browse files
authored
DOC: Small improvements and corrections (#2631)
1 parent a584fb5 commit a435eaa

File tree

3 files changed

+15
-15
lines changed

3 files changed

+15
-15
lines changed

docs/user/extract-text.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
7272

7373
### Example 1: Ignore header and footer
7474

75-
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
75+
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y > 720) and footer (y < 50).
7676

7777
```python
7878
from pypdf import PdfReader
@@ -171,17 +171,17 @@ Then there are issues where most people would agree on the correct output, but
171171
the way PDF stores information just makes it hard to achieve that:
172172

173173
1. **Tables**: Typically, tables are just absolutely positioned text. In the worst
174-
case, ever single letter could be absolutely positioned. That makes it hard
174+
case, every single letter could be absolutely positioned. That makes it hard
175175
to tell where columns / rows are.
176-
2. **Images**: Sometimes PDFs do not contain the text as it's displayed, but
176+
2. **Images**: Sometimes PDFs do not contain the text as it is displayed, but
177177
instead an image. You notice that when you cannot copy the text. Then there
178178
are PDF files that contain an image and a text layer in the background.
179179
That typically happens when a document was scanned. Although the scanning
180180
software (OCR) is pretty good today, it still fails once in a while. pypdf
181181
is no OCR software; it will not be able to detect those failures. pypdf
182182
will also never be able to extract text from images.
183183

184-
And finally there are issues that pypdf will deal with. If you find such a
184+
Finally there are issues that pypdf will deal with. If you find such a
185185
text extraction bug, please share the PDF with us so we can work on it!
186186

187187
### Missing Semantic Layer
@@ -196,7 +196,7 @@ find heuristics to make educated guesses, but there is no way of being certain.
196196

197197
This is a shortcoming of the PDF file format, not of pypdf.
198198

199-
It would be possible to apply machine learning on PDF documents to make good
199+
It is possible to apply machine learning on PDF documents to make good
200200
heuristics, but that will not be part of pypdf. However, pypdf could be used to
201201
feed such a machine learning system with the relevant information.
202202

@@ -229,7 +229,7 @@ More information:
229229
Optical Character Recognition (OCR) is the process of extracting text from
230230
images. Software which does this is called *OCR software*. The
231231
[tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) is the
232-
most commonly known Open Source OCR software.
232+
most commonly known open source OCR software.
233233

234234
pypdf is **not** OCR software.
235235

@@ -279,7 +279,7 @@ pypdf also has an edge when it comes to characters which are rare, e.g.
279279

280280
## Attempts to prevent text extraction
281281

282-
If people who share PDF documents want to prevent text extraction, there are
282+
If people who share PDF documents want to prevent text extraction, they have
283283
multiple ways to do so:
284284

285285
1. Store the contents of the PDF as an image

pypdf/_doc_common.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@ def _repr_mimebundle_(
308308
"""
309309
Integration into Jupyter Notebooks.
310310
311-
This method returns a dictionary that maps a mime-type to it's
311+
This method returns a dictionary that maps a mime-type to its
312312
representation.
313313
314314
See https://ipython.readthedocs.io/en/stable/config/integrating.html
@@ -848,9 +848,9 @@ def threads(self) -> Optional[ArrayObject]:
848848
"""
849849
Read-only property for the list of threads.
850850
851-
See §8.3.2 from PDF 1.7 spec.
851+
See §12.4.3 from the PDF 1.7 or 2.0 specification.
852852
853-
It's an array of dictionaries with "/F" and "/I" properties or
853+
It is an array of dictionaries with "/F" and "/I" properties or
854854
None if there are no articles.
855855
"""
856856
catalog = self.root_object
@@ -1005,9 +1005,9 @@ def pages(self) -> List[PageObject]:
10051005
10061006
For PdfWriter Only:
10071007
It provides also capability to remove a page/range of page from the list
1008-
(through del operator)
1008+
(using the del operator)
10091009
Note: only the page entry is removed. As the objects beneath can be used
1010-
somewhere else.
1010+
elsewhere.
10111011
A solution to completely remove them - if they are not used anywhere -
10121012
is to write to a buffer/temporary file and to load it into a new PdfWriter
10131013
object afterwards.

pypdf/_page.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1996,11 +1996,11 @@ def extract_text(
19961996
will change if this function is made more sophisticated.
19971997
19981998
Arabic and Hebrew are extracted in the correct order.
1999-
If required an custom RTL range of characters can be defined;
1999+
If required a custom RTL range of characters can be defined;
20002000
see function set_custom_rtl
20012001
2002-
Additionally you can provide visitor-methods to get informed on all
2003-
operations and all text-objects.
2002+
Additionally you can provide visitor methods to get informed on all
2003+
operations and all text objects.
20042004
For example in some PDF files this can be useful to parse tables.
20052005
20062006
Args:

0 commit comments

Comments
 (0)