minor update

dothinking · dothinking · commit 264bec5a153e · 2022-05-05T09:24:37.000+08:00
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+English | [中文](README_CN.md)
+
 # pdf2docx 
 
 ![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
@@ -12,35 +14,35 @@
 
 ## Features
 
-- [x] Parse and re-create page layout
-    - [x] page margin
-    - [x] section and column (1 or 2 columns only)
-    - [ ] page header and footer
-
-- [x] Parse and re-create paragraph
-    - [ ] OCR text
-    - [x] text in horizontal/vertical direction: from left to right, from bottom to top
-    - [x] font style, e.g. font name, size, weight, italic and color
-    - [x] text format, e.g. highlight, underline, strike-through
-    - [ ] list style
-    - [x] external hyper link
-    - [x] paragraph horizontal alignment (left/right/center/justify) and vertical spacing
+- Parse and re-create page layout
+    - page margin
+    - section and column (1 or 2 columns only)
+    - page header and footer [TODO]
+
+- Parse and re-create paragraph
+    - OCR text [TODO]
+    - text in horizontal/vertical direction: from left to right, from bottom to top
+    - font style, e.g. font name, size, weight, italic and color
+    - text format, e.g. highlight, underline, strike-through
+    - list style [TODO]
+    - external hyper link
+    - paragraph horizontal alignment (left/right/center/justify) and vertical spacing
     
-- [x] Parse and re-create image
-	- [x] in-line image
-    - [x] image in Gray/RGB/CMYK mode
-    - [x] transparent image
-    - [x] floating image, i.e. picture behind text
-
-- [x] Parse and re-create table
-    - [x] border style, e.g. width, color
-    - [x] shading style, i.e. background color
-    - [x] merged cells
-    - [x] vertical direction cell
-    - [x] table with partly hidden borders
-    - [x] nested tables
-
-- [x] Parsing pages with multi-processing
+- Parse and re-create image
+	- in-line image
+    - image in Gray/RGB/CMYK mode
+    - transparent image
+    - floating image, i.e. picture behind text
+
+- Parse and re-create table
+    - border style, e.g. width, color
+    - shading style, i.e. background color
+    - merged cells
+    - vertical direction cell
+    - table with partly hidden borders
+    - nested tables
+
+- Parsing pages with multi-processing
 
 *It can also be used as a tool to extract table contents since both table content and format/style is parsed.*
 
diff --git a/README_CN.md b/README_CN.md
@@ -0,0 +1,70 @@
+[English](README.md) | 中文
+
+# pdf2docx 
+
+![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
+[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
+[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
+![license](https://img.shields.io/pypi/l/pdf2docx.svg)
+![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)
+
+- 基于 `PyMuPDF` 提取文本、图片、矢量等原始数据 
+- 基于规则解析章节、段落、表格、图片、文本等布局及样式
+- 基于 `python-docx` 创建Word文档
+
+## 主要功能
+
+- 解析和创建页面布局
+    - 页边距
+    - 章节和分栏 (目前最多支持两栏布局)
+    - 页眉和页脚 [TODO]
+
+- 解析和创建段落
+    - OCR 文本 [TODO] 
+    - 水平（从左到右）或竖直（自底向上）方向文本
+    - 字体样式例如字体、字号、粗/斜体、颜色
+    - 文本样式例如高亮、下划线和删除线
+    - 列表样式 [TODO]
+    - 外部超链接
+    - 段落水平对齐方式 (左/右/居中/分散对齐)及前后间距
+    
+- 解析和创建图片
+	- 内联图片
+    - 灰度/RGB/CMYK等颜色空间图片
+    - 带有透明通道图片
+    - 浮动图片（衬于文字下方）
+
+- 解析和创建表格
+    - 边框样式例如宽度和颜色
+    - 单元格背景色
+    - 合并单元格
+    - 单元格垂直文本
+    - 隐藏部分边框线的表格
+    - 嵌套表格
+
+- 支持多进程转换
+
+*`pdf2docx`同时解析出了表格内容和样式，因此也可以作为一个表格内容提取工具。*
+
+## 限制
+
+- 目前暂不支持扫描PDF文字识别
+- 仅支持从左向右书写的语言（因此不支持阿拉伯语）
+- 不支持旋转的文字
+- 基于规则的解析无法保证100%还原PDF样式
+
+
+## 使用帮助
+
+- [安装](https://dothinking.github.io/pdf2docx/installation.html)
+- [快速上手](https://dothinking.github.io/pdf2docx/quickstart.html)
+    - [转换PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
+    - [提取表格](https://dothinking.github.io/pdf2docx/quickstart.table.html)
+    - [命令行参数](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
+    - [简单图形界面](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
+- [技术手册](https://dothinking.github.io/pdf2docx/techdoc.html)
+- [API手册](https://dothinking.github.io/pdf2docx/modules.html)
+
+## 样例
+
+![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)
diff --git a/doc/installation.rst b/doc/installation.rst
@@ -4,24 +4,41 @@ Installation
 ``pdf2docx`` can be installed from either Pypi or the source code.
 
 
-From Pypi
-----------------
-::
+Install from Pypi
+-------------------
+
+Type the command below for a new installation::
 
   $ pip install pdf2docx
 
+Or, upgrade this library with::
 
-From source code
--------------------
+  $ pip install --upgrade pdf2docx
+
+
+Install from source code remotely
+--------------------------------------
 
-Clone or download `pdf2docx <https://github.com/dothinking/pdf2docx>`_, and navigate to the root directory::
+Install ``pdf2docx`` directly from the ``master`` branch::
+
+  $ pip install git+git://github.com/dothinking/pdf2docx.git@master --upgrade
+
+.. note::
+  In this way, ``pdf2docx`` might have a higher version than Pypi, which is not released yet.
+
+
+Install from source code locally
+---------------------------------------
+
+Clone or download `pdf2docx <https://github.com/dothinking/pdf2docx>`_, navigate to the root directory and run::
 
   $ python setup.py install 
 
-Or install it in developing mode::
+Or, install it in developing mode::
 
   $ python setup.py develop
 
+
 Uninstall
 --------------
 
diff --git a/pdf2docx/common/Element.py b/pdf2docx/common/Element.py
@@ -50,7 +50,7 @@ def pure_rotation_matrix(cls):
 
     def __init__(self, raw:dict=None, parent=None):
         ''' Initialize Element and convert to the real (rotation considered) page coordinate system.'''        
-        self.bbox = fitz.Rect()
+        self.bbox = fitz.Rect()  # type: fitz.Rect
         self._parent = parent # type: Element
 
         # NOTE: Any coordinates provided in raw is in original page CS (without considering page rotation).
@@ -130,7 +130,7 @@ def union_bbox(self, e):
     # --------------------------------------------
     # location relationship to other Element instance
     # -------------------------------------------- 
-    def contains(self, e, threshold:float=1.0):
+    def contains(self, e:'Element', threshold:float=1.0):
         """Whether given element is contained in this instance, with margin considered.
 
         Args:
diff --git a/pdf2docx/font/Fonts.py b/pdf2docx/font/Fonts.py
@@ -72,8 +72,10 @@ def extract(cls, fitz_doc):
             name = cls._normalized_font_name(basename)
             
             try:
-                # process embedded and supported fonts (true type) only
-                assert ext not in ('n/a', 'ccf'), "base font or not supported font"
+                # supported fonts: open/true type only
+                # - n/a: base 14 fonts
+                # - cff: Adobe Compact File Format, i.e. Type 1 font
+                assert ext not in ('n/a', 'cff'), "base font or not supported font"
 
                 # try to get more font metrices with fonttool
                 tt = TTFont(BytesIO(buffer))
diff --git a/setup.py b/setup.py
@@ -4,6 +4,7 @@
 import os
 from setuptools import find_packages, setup
 
+DESCRIPTION = 'Open source Python library converting pdf to docx.'
 EXCLUDE_FROM_PACKAGES = ["build", "dist", "test"]
 
 # read version number from version.txt, otherwise alpha version
@@ -23,7 +24,7 @@ def load_long_description(fname):
         with open(fname, 'r') as f:
             long_description = f.read()
     else:
-        long_description = 'Parse PDF file with PyMuPDF and generate docx with python-docx.'
+        long_description = DESCRIPTION
 
     return long_description
 
@@ -45,11 +46,11 @@ def load_requirements(fname):
 
 setup(
     name="pdf2docx",    
-    version=get_version('version.txt'),
+    version=get_version("version.txt"),
     keywords=["pdf-to-word", "pdf-to-docx"],
-    description="parse PDF files to docx",
-    long_description=load_long_description('README.md'),
-    long_description_content_type='text/markdown',
+    description=DESCRIPTION,
+    long_description=load_long_description("README.md"),
+    long_description_content_type="text/markdown",
     license="GPL v3", 
     author="dothinking",
     author_email="train8808@gmail.com",
diff --git a/test/samples/demo-text-hidden.pdf b/test/samples/demo-text-hidden.pdf
diff --git a/test/test.py b/test/test.py
@@ -148,6 +148,10 @@ def test_unnamed_fonts(self):
     def test_text_scaling(self):
         '''test font size. In this case, the font size is set precisely with character scaling.'''
         self.convert('demo-text-scaling')
+    
+    def test_text_hidden(self):
+        '''test hidden text, which is ignore by default.'''
+        self.convert('demo-text-hidden')
 
     # ------------------------------------------
     # image styles
@@ -303,6 +307,7 @@ class TestQuality:
         'demo-text-alignment.pdf': 0.90,
         'demo-text-scaling.pdf': 0.80,
         'demo-text-unnamed-fonts.pdf': 0.80,
+        'demo-text-hidden.pdf': 0.90,
         'demo-text.pdf': 0.80
     }