You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/textpage.rst
+31-4Lines changed: 31 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,11 +48,21 @@ For a description of what this class is all about, see Appendix 2.
48
48
49
49
Textpage content as a list of text lines grouped by block. Each list items looks like this::
50
50
51
-
(x0, y0, x1, y1, "lines in the block", block_no, block_type)
51
+
``(x0, y0, x1, y1, "lines in the block", block_no, block_type)``
52
52
53
-
The first four entries are the block's bbox coordinates, *block_type* is 1 for an image block, 0 for text. *block_no* is the block sequence number. Multiple text lines are joined via line breaks.
53
+
The first four entries are the block's bbox coordinates, *block_type* is 1 for an image block, 3 for a vector block, and 0 for text. *block_no* is the block sequence number. Multiple text lines are joined via line breaks.
54
54
55
-
For an image block, its bbox and a text line with some image meta information is included -- **not the image content**.
55
+
For an **image block**, its bbox and a text line with some image meta information is included -- not the image **content**. Image blocks are included only if the extraction flag bit :data:`TEXT_PRESERVE_IMAGES` is set. An image block tuple will look like this::
For a **vector block**, the following item will be included. Vector blocks are included only if the extraction flag bit :data:`TEXT_COLLECT_VECTORS` is set. A vector block tuple will look like this::
The keyword "vector" is followed by either "stroked" or "filled". The color is given in HTML (hexadecimal RGB) format. Property ``is-rect`` is true, if the vector is no curve and parallel to the x- or y-axis. So in essence is either a real rectangle or a line segment. Property ``continues`` indicates whether the vector is part of a path (and the not the first item).
64
+
65
+
.. note:: When no further details are needed (as provided by :meth:`Page.get_drawings`), then this extraction method is an inexpensive way to extract vector graphics. Another advantage is that all block types (text, images and vectors) are included in the output in the same order as they are present in the page's :data:`contents` stream.
56
66
57
67
This is a high-speed method with just enough information to output plain text in desired reading sequence.
58
68
@@ -200,7 +210,24 @@ blocks *list* of block dictionaries
200
210
201
211
Block Dictionaries
202
212
~~~~~~~~~~~~~~~~~~
203
-
Block dictionaries come in two different formats for **image blocks** and for **text blocks**.
213
+
Block dictionaries come in different formats for **vector blocks**, **image blocks** and **text blocks**. Vector blocks are included only if the extraction flag bit :data:`TEXT_COLLECT_VECTORS` is set. Image blocks are included only if the extraction flag bit :data:`TEXT_PRESERVE_IMAGES` is set.
This information is a true subset of the output of :meth:`Page.get_drawings`. Its advantage is its speed (because extracted with one :ref:`TextPage` creation) and the fact that vector blocks are included in the overall page content sequence together with text and images.
0 commit comments