|
5 | 5 |  |
6 | 6 |  |
7 | 7 |
|
8 | | -- Parse text, table and layout from PDF file with `PyMuPDF` |
| 8 | +- Parse layout (text, image and table) from PDF file with `PyMuPDF` |
9 | 9 | - Generate docx with `python-docx` |
10 | 10 |
|
11 | 11 | ## Features |
12 | 12 |
|
13 | 13 | - [x] Parse and re-create paragraph |
14 | | - - [x] text in horizontal direction: from left to right |
15 | | - - [x] text in vertical direction: from bottom to top |
| 14 | + - [x] text in horizontal/vertical direction: from left to right, from bottom to top |
16 | 15 | - [x] font style, e.g. font name, size, weight, italic and color |
17 | 16 | - [x] text format, e.g. highlight, underline, strike-through |
18 | | - - [x] text alignment, e.g. left/right/center/justify |
19 | | - - [ ] list style |
| 17 | + - [x] text alignment, e.g. left/right/center/justify |
20 | 18 | - [x] paragraph layout: horizontal alignment and vertical spacing |
| 19 | + - [ ] list style |
| 20 | + - [ ] href link |
21 | 21 |
|
22 | 22 | - [x] Parse and re-create image |
23 | 23 | - [x] in-line image |
24 | 24 | - [x] image in Gray/RGB/CMYK mode |
25 | 25 | - [x] transparent image |
| 26 | + - [x] floating image, i.e. picture behind text |
26 | 27 |
|
27 | 28 | - [x] Parse and re-create table |
28 | 29 | - [x] border style, e.g. width, color |
|
41 | 42 | - Normal reading direction only |
42 | 43 | - horizontal/vertical paragraph/line/word |
43 | 44 | - no word transformation, e.g. rotation |
44 | | -- No floating images |
45 | 45 |
|
46 | 46 |
|
47 | 47 | ## Installation |
@@ -74,80 +74,110 @@ $ pip uninstall pdf2docx |
74 | 74 |
|
75 | 75 | ## Usage |
76 | 76 |
|
| 77 | +`pdf2docx` can be used as either CLI or a library. |
| 78 | + |
| 79 | +### Command Line Interface |
| 80 | + |
77 | 81 | ``` |
78 | 82 | $ pdf2docx --help |
79 | 83 |
|
80 | 84 | NAME |
81 | | - pdf2docx - Run the pdf2docx parser. |
| 85 | + pdf2docx - Command line interface for pdf2docx. |
82 | 86 |
|
83 | 87 | SYNOPSIS |
84 | | - pdf2docx PDF_FILE <flags> |
| 88 | + pdf2docx COMMAND | - |
85 | 89 |
|
86 | 90 | DESCRIPTION |
87 | | - Run the pdf2docx parser. |
| 91 | + Command line interface for pdf2docx. |
88 | 92 |
|
89 | | -POSITIONAL ARGUMENTS |
90 | | - PDF_FILE |
91 | | - PDF filename to read from |
| 93 | +COMMANDS |
| 94 | + COMMAND is one of the following: |
92 | 95 |
|
93 | | -FLAGS |
94 | | - --docx_file=DOCX_FILE |
95 | | - DOCX filename to write to |
96 | | - --start=START |
97 | | - first page to process, starting from zero |
98 | | - --end=END |
99 | | - last page to process, starting from zero |
100 | | - --pages=PAGES |
101 | | - range of pages |
102 | | - --multi_processing=MULTI_PROCESSING |
| 96 | + convert |
| 97 | + Convert pdf file to docx file. |
103 | 98 |
|
104 | | -NOTES |
105 | | - You can also use flags syntax for POSITIONAL ARGUMENTS |
| 99 | + debug |
| 100 | + Convert one PDF page and plot layout information for debugging. |
| 101 | +
|
| 102 | + table |
| 103 | + Extract table content from pdf pages. |
106 | 104 | ``` |
107 | 105 |
|
108 | | -### By range of pages |
| 106 | +- By range of pages |
109 | 107 |
|
110 | | -``` |
111 | | -$ pdf2docx test.pdf test.docx --start=5 --end=10 |
112 | | -``` |
| 108 | +Specify pages range by `--start` (from the first page if omitted) and `--end` (to the last page if omitted). Note the page index is zero-based by default, but can turn it off by `--zero_based_index=False`, i.e. the first page index starts from 1. |
113 | 109 |
|
114 | | -### By page numbers |
115 | 110 |
|
116 | | -``` |
117 | | -$ pdf2docx test.pdf test.docx --pages=5,7,9 |
| 111 | +```bash |
| 112 | +$ pdf2docx convert test.pdf test.docx # all pages |
| 113 | + |
| 114 | +$ pdf2docx convert test.pdf test.docx --start=1 # from the second page to the end |
| 115 | + |
| 116 | +$ pdf2docx convert test.pdf test.docx --end=3 # from the first page to the third (index=2) |
| 117 | + |
| 118 | +$ pdf2docx convert test.pdf test.docx --start=1 --end=3 # the second and third pages |
| 119 | + |
| 120 | +$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False # the first and second pages |
| 121 | + |
118 | 122 | ``` |
119 | 123 |
|
120 | | -### Multi-Processing |
| 124 | +- By page numbers |
121 | 125 |
|
| 126 | +```bash |
| 127 | +$ pdf2docx convert test.pdf test.docx --pages=0,2,4 # the first, third and 5th pages |
122 | 128 | ``` |
123 | | -$ pdf2docx test.pdf --multi_processing=True |
| 129 | + |
| 130 | +- Multi-Processing |
| 131 | + |
| 132 | +```bash |
| 133 | +$ pdf2docx convert test.pdf test.docx --multi_processing=True # default count of CPU |
| 134 | + |
| 135 | +$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4 |
124 | 136 | ``` |
125 | 137 |
|
126 | 138 |
|
127 | | -### As a library |
| 139 | +### Python Library |
| 140 | + |
| 141 | +We can use either the `Converter` class or a wrapped method `parse()`. |
| 142 | + |
| 143 | +- `Converter` |
128 | 144 |
|
129 | 145 | ```python |
130 | | -''' With this library installed with |
131 | | - `pip install pdf2docx`, or `python setup.py install`. |
132 | | -''' |
| 146 | +from pdf2docx import Converter |
| 147 | + |
| 148 | +pdf_file = '/path/to/sample.pdf' |
| 149 | +docx_file = 'path/to/sample.docx' |
133 | 150 |
|
| 151 | +# convert pdf to docx |
| 152 | +cv = Converter(pdf_file) |
| 153 | +cv.convert(docx_file, start=0, end=None) |
| 154 | +cv.close() |
| 155 | +``` |
| 156 | + |
| 157 | + |
| 158 | +- Wrapped method `parse()` |
| 159 | + |
| 160 | +```python |
134 | 161 | from pdf2docx import parse |
135 | 162 |
|
136 | 163 | pdf_file = '/path/to/sample.pdf' |
137 | 164 | docx_file = 'path/to/sample.docx' |
138 | 165 |
|
139 | 166 | # convert pdf to docx |
140 | | -parse(pdf_file, docx_file, start=0, end=1) |
| 167 | +parse(pdf_file, docx_file, start=0, end=None) |
141 | 168 | ``` |
142 | 169 |
|
143 | 170 | Or just to extract tables, |
144 | 171 |
|
145 | 172 | ```python |
146 | | -from pdf2docx import extract_tables |
| 173 | +from pdf2docx import Converter |
147 | 174 |
|
148 | 175 | pdf_file = '/path/to/sample.pdf' |
149 | 176 |
|
150 | | -tables = extract_tables(pdf_file, start=0, end=1) |
| 177 | +cv = Converter(pdf_file) |
| 178 | +tables = cv.extract_tables(start=0, end=1) |
| 179 | +cv.close() |
| 180 | + |
151 | 181 | for table in tables: |
152 | 182 | print(table) |
153 | 183 |
|
|
0 commit comments