Skip to content

Commit 264bec5

Browse files
committed
minor update
2 parents 64f07ef + 30835d0 commit 264bec5

File tree

8 files changed

+141
-44
lines changed

8 files changed

+141
-44
lines changed

README.md

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
English | [中文](README_CN.md)
2+
13
# pdf2docx
24

35
![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
@@ -12,35 +14,35 @@
1214

1315
## Features
1416

15-
- [x] Parse and re-create page layout
16-
- [x] page margin
17-
- [x] section and column (1 or 2 columns only)
18-
- [ ] page header and footer
19-
20-
- [x] Parse and re-create paragraph
21-
- [ ] OCR text
22-
- [x] text in horizontal/vertical direction: from left to right, from bottom to top
23-
- [x] font style, e.g. font name, size, weight, italic and color
24-
- [x] text format, e.g. highlight, underline, strike-through
25-
- [ ] list style
26-
- [x] external hyper link
27-
- [x] paragraph horizontal alignment (left/right/center/justify) and vertical spacing
17+
- Parse and re-create page layout
18+
- page margin
19+
- section and column (1 or 2 columns only)
20+
- page header and footer [TODO]
21+
22+
- Parse and re-create paragraph
23+
- OCR text [TODO]
24+
- text in horizontal/vertical direction: from left to right, from bottom to top
25+
- font style, e.g. font name, size, weight, italic and color
26+
- text format, e.g. highlight, underline, strike-through
27+
- list style [TODO]
28+
- external hyper link
29+
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
2830

29-
- [x] Parse and re-create image
30-
- [x] in-line image
31-
- [x] image in Gray/RGB/CMYK mode
32-
- [x] transparent image
33-
- [x] floating image, i.e. picture behind text
34-
35-
- [x] Parse and re-create table
36-
- [x] border style, e.g. width, color
37-
- [x] shading style, i.e. background color
38-
- [x] merged cells
39-
- [x] vertical direction cell
40-
- [x] table with partly hidden borders
41-
- [x] nested tables
42-
43-
- [x] Parsing pages with multi-processing
31+
- Parse and re-create image
32+
- in-line image
33+
- image in Gray/RGB/CMYK mode
34+
- transparent image
35+
- floating image, i.e. picture behind text
36+
37+
- Parse and re-create table
38+
- border style, e.g. width, color
39+
- shading style, i.e. background color
40+
- merged cells
41+
- vertical direction cell
42+
- table with partly hidden borders
43+
- nested tables
44+
45+
- Parsing pages with multi-processing
4446

4547
*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*
4648

README_CN.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
[English](README.md) | 中文
2+
3+
# pdf2docx
4+
5+
![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
6+
[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
7+
[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
8+
![license](https://img.shields.io/pypi/l/pdf2docx.svg)
9+
![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)
10+
11+
- 基于 `PyMuPDF` 提取文本、图片、矢量等原始数据
12+
- 基于规则解析章节、段落、表格、图片、文本等布局及样式
13+
- 基于 `python-docx` 创建Word文档
14+
15+
## 主要功能
16+
17+
- 解析和创建页面布局
18+
- 页边距
19+
- 章节和分栏 (目前最多支持两栏布局)
20+
- 页眉和页脚 [TODO]
21+
22+
- 解析和创建段落
23+
- OCR 文本 [TODO]
24+
- 水平(从左到右)或竖直(自底向上)方向文本
25+
- 字体样式例如字体、字号、粗/斜体、颜色
26+
- 文本样式例如高亮、下划线和删除线
27+
- 列表样式 [TODO]
28+
- 外部超链接
29+
- 段落水平对齐方式 (左/右/居中/分散对齐)及前后间距
30+
31+
- 解析和创建图片
32+
- 内联图片
33+
- 灰度/RGB/CMYK等颜色空间图片
34+
- 带有透明通道图片
35+
- 浮动图片(衬于文字下方)
36+
37+
- 解析和创建表格
38+
- 边框样式例如宽度和颜色
39+
- 单元格背景色
40+
- 合并单元格
41+
- 单元格垂直文本
42+
- 隐藏部分边框线的表格
43+
- 嵌套表格
44+
45+
- 支持多进程转换
46+
47+
*`pdf2docx`同时解析出了表格内容和样式,因此也可以作为一个表格内容提取工具。*
48+
49+
## 限制
50+
51+
- 目前暂不支持扫描PDF文字识别
52+
- 仅支持从左向右书写的语言(因此不支持阿拉伯语)
53+
- 不支持旋转的文字
54+
- 基于规则的解析无法保证100%还原PDF样式
55+
56+
57+
## 使用帮助
58+
59+
- [安装](https://dothinking.github.io/pdf2docx/installation.html)
60+
- [快速上手](https://dothinking.github.io/pdf2docx/quickstart.html)
61+
- [转换PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
62+
- [提取表格](https://dothinking.github.io/pdf2docx/quickstart.table.html)
63+
- [命令行参数](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
64+
- [简单图形界面](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
65+
- [技术手册](https://dothinking.github.io/pdf2docx/techdoc.html)
66+
- [API手册](https://dothinking.github.io/pdf2docx/modules.html)
67+
68+
## 样例
69+
70+
![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)

doc/installation.rst

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,41 @@ Installation
44
``pdf2docx`` can be installed from either Pypi or the source code.
55

66

7-
From Pypi
8-
----------------
9-
::
7+
Install from Pypi
8+
-------------------
9+
10+
Type the command below for a new installation::
1011

1112
$ pip install pdf2docx
1213

14+
Or, upgrade this library with::
1315

14-
From source code
15-
-------------------
16+
$ pip install --upgrade pdf2docx
17+
18+
19+
Install from source code remotely
20+
--------------------------------------
1621

17-
Clone or download `pdf2docx <https://github.com/dothinking/pdf2docx>`_, and navigate to the root directory::
22+
Install ``pdf2docx`` directly from the ``master`` branch::
23+
24+
$ pip install git+git://github.com/dothinking/pdf2docx.git@master --upgrade
25+
26+
.. note::
27+
In this way, ``pdf2docx`` might have a higher version than Pypi, which is not released yet.
28+
29+
30+
Install from source code locally
31+
---------------------------------------
32+
33+
Clone or download `pdf2docx <https://github.com/dothinking/pdf2docx>`_, navigate to the root directory and run::
1834

1935
$ python setup.py install
2036

21-
Or install it in developing mode::
37+
Or, install it in developing mode::
2238

2339
$ python setup.py develop
2440

41+
2542
Uninstall
2643
--------------
2744

pdf2docx/common/Element.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def pure_rotation_matrix(cls):
5050

5151
def __init__(self, raw:dict=None, parent=None):
5252
''' Initialize Element and convert to the real (rotation considered) page coordinate system.'''
53-
self.bbox = fitz.Rect()
53+
self.bbox = fitz.Rect() # type: fitz.Rect
5454
self._parent = parent # type: Element
5555

5656
# NOTE: Any coordinates provided in raw is in original page CS (without considering page rotation).
@@ -130,7 +130,7 @@ def union_bbox(self, e):
130130
# --------------------------------------------
131131
# location relationship to other Element instance
132132
# --------------------------------------------
133-
def contains(self, e, threshold:float=1.0):
133+
def contains(self, e:'Element', threshold:float=1.0):
134134
"""Whether given element is contained in this instance, with margin considered.
135135
136136
Args:

pdf2docx/font/Fonts.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,10 @@ def extract(cls, fitz_doc):
7272
name = cls._normalized_font_name(basename)
7373

7474
try:
75-
# process embedded and supported fonts (true type) only
76-
assert ext not in ('n/a', 'ccf'), "base font or not supported font"
75+
# supported fonts: open/true type only
76+
# - n/a: base 14 fonts
77+
# - cff: Adobe Compact File Format, i.e. Type 1 font
78+
assert ext not in ('n/a', 'cff'), "base font or not supported font"
7779

7880
# try to get more font metrices with fonttool
7981
tt = TTFont(BytesIO(buffer))

setup.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import os
55
from setuptools import find_packages, setup
66

7+
DESCRIPTION = 'Open source Python library converting pdf to docx.'
78
EXCLUDE_FROM_PACKAGES = ["build", "dist", "test"]
89

910
# read version number from version.txt, otherwise alpha version
@@ -23,7 +24,7 @@ def load_long_description(fname):
2324
with open(fname, 'r') as f:
2425
long_description = f.read()
2526
else:
26-
long_description = 'Parse PDF file with PyMuPDF and generate docx with python-docx.'
27+
long_description = DESCRIPTION
2728

2829
return long_description
2930

@@ -45,11 +46,11 @@ def load_requirements(fname):
4546

4647
setup(
4748
name="pdf2docx",
48-
version=get_version('version.txt'),
49+
version=get_version("version.txt"),
4950
keywords=["pdf-to-word", "pdf-to-docx"],
50-
description="parse PDF files to docx",
51-
long_description=load_long_description('README.md'),
52-
long_description_content_type='text/markdown',
51+
description=DESCRIPTION,
52+
long_description=load_long_description("README.md"),
53+
long_description_content_type="text/markdown",
5354
license="GPL v3",
5455
author="dothinking",
5556
author_email="train8808@gmail.com",

test/samples/demo-text-hidden.pdf

82.9 KB
Binary file not shown.

test/test.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,10 @@ def test_unnamed_fonts(self):
148148
def test_text_scaling(self):
149149
'''test font size. In this case, the font size is set precisely with character scaling.'''
150150
self.convert('demo-text-scaling')
151+
152+
def test_text_hidden(self):
153+
'''test hidden text, which is ignore by default.'''
154+
self.convert('demo-text-hidden')
151155

152156
# ------------------------------------------
153157
# image styles
@@ -303,6 +307,7 @@ class TestQuality:
303307
'demo-text-alignment.pdf': 0.90,
304308
'demo-text-scaling.pdf': 0.80,
305309
'demo-text-unnamed-fonts.pdf': 0.80,
310+
'demo-text-hidden.pdf': 0.90,
306311
'demo-text.pdf': 0.80
307312
}
308313

0 commit comments

Comments
 (0)