Skip to content

Commit 0a3916e

Browse files
committed
refactor Converter #61 #64
2 parents 4ac966f + adb5997 commit 0a3916e

33 files changed

+522358
-522261
lines changed

README.md

Lines changed: 70 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -5,24 +5,25 @@
55
![pdf2docx-publish](https://github.com/dothinking/pdf2docx/workflows/pdf2docx-publish/badge.svg)
66
![GitHub](https://img.shields.io/github/license/dothinking/pdf2docx)
77

8-
- Parse text, table and layout from PDF file with `PyMuPDF`
8+
- Parse layout (text, image and table) from PDF file with `PyMuPDF`
99
- Generate docx with `python-docx`
1010

1111
## Features
1212

1313
- [x] Parse and re-create paragraph
14-
- [x] text in horizontal direction: from left to right
15-
- [x] text in vertical direction: from bottom to top
14+
- [x] text in horizontal/vertical direction: from left to right, from bottom to top
1615
- [x] font style, e.g. font name, size, weight, italic and color
1716
- [x] text format, e.g. highlight, underline, strike-through
18-
- [x] text alignment, e.g. left/right/center/justify
19-
- [ ] list style
17+
- [x] text alignment, e.g. left/right/center/justify
2018
- [x] paragraph layout: horizontal alignment and vertical spacing
19+
- [ ] list style
20+
- [ ] href link
2121

2222
- [x] Parse and re-create image
2323
- [x] in-line image
2424
- [x] image in Gray/RGB/CMYK mode
2525
- [x] transparent image
26+
- [x] floating image, i.e. picture behind text
2627

2728
- [x] Parse and re-create table
2829
- [x] border style, e.g. width, color
@@ -41,7 +42,6 @@
4142
- Normal reading direction only
4243
- horizontal/vertical paragraph/line/word
4344
- no word transformation, e.g. rotation
44-
- No floating images
4545

4646

4747
## Installation
@@ -74,80 +74,110 @@ $ pip uninstall pdf2docx
7474

7575
## Usage
7676

77+
`pdf2docx` can be used as either CLI or a library.
78+
79+
### Command Line Interface
80+
7781
```
7882
$ pdf2docx --help
7983
8084
NAME
81-
pdf2docx - Run the pdf2docx parser.
85+
pdf2docx - Command line interface for pdf2docx.
8286
8387
SYNOPSIS
84-
pdf2docx PDF_FILE <flags>
88+
pdf2docx COMMAND | -
8589
8690
DESCRIPTION
87-
Run the pdf2docx parser.
91+
Command line interface for pdf2docx.
8892
89-
POSITIONAL ARGUMENTS
90-
PDF_FILE
91-
PDF filename to read from
93+
COMMANDS
94+
COMMAND is one of the following:
9295
93-
FLAGS
94-
--docx_file=DOCX_FILE
95-
DOCX filename to write to
96-
--start=START
97-
first page to process, starting from zero
98-
--end=END
99-
last page to process, starting from zero
100-
--pages=PAGES
101-
range of pages
102-
--multi_processing=MULTI_PROCESSING
96+
convert
97+
Convert pdf file to docx file.
10398
104-
NOTES
105-
You can also use flags syntax for POSITIONAL ARGUMENTS
99+
debug
100+
Convert one PDF page and plot layout information for debugging.
101+
102+
table
103+
Extract table content from pdf pages.
106104
```
107105

108-
### By range of pages
106+
- By range of pages
109107

110-
```
111-
$ pdf2docx test.pdf test.docx --start=5 --end=10
112-
```
108+
Specify pages range by `--start` (from the first page if omitted) and `--end` (to the last page if omitted). Note the page index is zero-based by default, but can turn it off by `--zero_based_index=False`, i.e. the first page index starts from 1.
113109

114-
### By page numbers
115110

116-
```
117-
$ pdf2docx test.pdf test.docx --pages=5,7,9
111+
```bash
112+
$ pdf2docx convert test.pdf test.docx # all pages
113+
114+
$ pdf2docx convert test.pdf test.docx --start=1 # from the second page to the end
115+
116+
$ pdf2docx convert test.pdf test.docx --end=3 # from the first page to the third (index=2)
117+
118+
$ pdf2docx convert test.pdf test.docx --start=1 --end=3 # the second and third pages
119+
120+
$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False # the first and second pages
121+
118122
```
119123

120-
### Multi-Processing
124+
- By page numbers
121125

126+
```bash
127+
$ pdf2docx convert test.pdf test.docx --pages=0,2,4 # the first, third and 5th pages
122128
```
123-
$ pdf2docx test.pdf --multi_processing=True
129+
130+
- Multi-Processing
131+
132+
```bash
133+
$ pdf2docx convert test.pdf test.docx --multi_processing=True # default count of CPU
134+
135+
$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4
124136
```
125137

126138

127-
### As a library
139+
### Python Library
140+
141+
We can use either the `Converter` class or a wrapped method `parse()`.
142+
143+
- `Converter`
128144

129145
```python
130-
''' With this library installed with
131-
`pip install pdf2docx`, or `python setup.py install`.
132-
'''
146+
from pdf2docx import Converter
147+
148+
pdf_file = '/path/to/sample.pdf'
149+
docx_file = 'path/to/sample.docx'
133150

151+
# convert pdf to docx
152+
cv = Converter(pdf_file)
153+
cv.convert(docx_file, start=0, end=None)
154+
cv.close()
155+
```
156+
157+
158+
- Wrapped method `parse()`
159+
160+
```python
134161
from pdf2docx import parse
135162

136163
pdf_file = '/path/to/sample.pdf'
137164
docx_file = 'path/to/sample.docx'
138165

139166
# convert pdf to docx
140-
parse(pdf_file, docx_file, start=0, end=1)
167+
parse(pdf_file, docx_file, start=0, end=None)
141168
```
142169

143170
Or just to extract tables,
144171

145172
```python
146-
from pdf2docx import extract_tables
173+
from pdf2docx import Converter
147174

148175
pdf_file = '/path/to/sample.pdf'
149176

150-
tables = extract_tables(pdf_file, start=0, end=1)
177+
cv = Converter(pdf_file)
178+
tables = cv.extract_tables(start=0, end=1)
179+
cv.close()
180+
151181
for table in tables:
152182
print(table)
153183

pdf2docx/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
from .converter import Converter
2-
from .layout.Layout import Layout
2+
from .page.Page import Page
33
from .main import parse

pdf2docx/common/share.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -198,15 +198,15 @@ def inner(*args, **kwargs):
198198
# execute function
199199
objects = func(*args, **kwargs)
200200

201-
# check if plot layout
202-
layout = args[0] # Layout object
203-
debug = layout.settings.get('debug', False)
204-
doc = layout.settings.get('debug_doc', None)
205-
filename = layout.settings.get('debug_filename', None)
201+
# check if plot page
202+
page = args[0] # Page object
203+
debug = page.settings.get('debug', False)
204+
doc = page.settings.get('debug_doc', None)
205+
filename = page.settings.get('debug_filename', None)
206206

207207
if objects and debug and doc is not None:
208208
# create a new page
209-
page = new_page(doc, layout.width, layout.height, title)
209+
page = new_page(doc, page.width, page.height, title)
210210
# plot objects, e.g. text blocks, shapes, tables...
211211
objects.plot(page)
212212
doc.save(filename)

0 commit comments

Comments
 (0)