Skip to content

Commit 366c06b

Browse files
feat(ocr): add OCR fallback for scanned PDFs and images (#20)
1 parent 0674253 commit 366c06b

17 files changed

+2028
-416
lines changed

MarkItDown.spec

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,17 @@ hiddenimports = [
88
"requests",
99
]
1010
hiddenimports += collect_submodules("markitdown")
11+
for package in (
12+
"azure.ai.documentintelligence",
13+
"azure.identity",
14+
"pypdfium2",
15+
"pypdfium2_raw",
16+
"pytesseract",
17+
):
18+
try:
19+
hiddenimports += collect_submodules(package)
20+
except Exception as e:
21+
print(f"Warning: Could not collect hidden imports for {package}: {e}")
1122

1223
datas = [
1324
("markitdowngui/resources/markitdown-gui.ico", "markitdowngui/resources"),
@@ -21,6 +32,12 @@ try:
2132
except Exception as e:
2233
print(f"Warning: Could not collect magika data files: {e}")
2334

35+
for package in ("pypdfium2", "pypdfium2_raw"):
36+
try:
37+
datas += collect_data_files(package)
38+
except Exception as e:
39+
print(f"Warning: Could not collect data files for {package}: {e}")
40+
2441
a = Analysis(
2542
["markitdowngui/main.py"],
2643
pathex=[],
@@ -30,7 +47,10 @@ a = Analysis(
3047
hookspath=[],
3148
hooksconfig={},
3249
runtime_hooks=[],
33-
excludes=[],
50+
excludes=[
51+
"tkinter", "_tkinter",
52+
"pytest", "_pytest", "pygments",
53+
],
3454
noarchive=False,
3555
optimize=1,
3656
)
@@ -67,4 +87,3 @@ coll = COLLECT(
6787
upx_exclude=[],
6888
name="MarkItDown",
6989
)
70-

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ It focuses on fast multi-file conversion to Markdown with a modern Fluent-style
1515
- Preview modes: rendered Markdown view and raw Markdown view.
1616
- Save modes: export as one combined file or separate files.
1717
- Quick actions: copy Markdown, save output, back to queue, start over.
18-
- Settings for output folder, batch size, header style, table style, and theme mode (light/dark/system).
18+
- Optional OCR for scanned PDFs and image files, with Azure Document Intelligence first and local Tesseract fallback.
19+
- Settings for output folder, batch size, header style, table style, OCR, and theme mode (light/dark/system).
1920
- Built-in shortcuts dialog, update check action, and about dialog.
2021

2122
## Installation
@@ -39,6 +40,15 @@ Alternative:
3940
pip install -e .[dev]
4041
```
4142

43+
### OCR Notes
44+
45+
- OCR is optional and disabled by default.
46+
- Local OCR requires a system `tesseract` binary. Install it from the [official Tesseract project](https://github.com/tesseract-ocr/tesseract). If it is not on your `PATH`, set the executable path in Settings.
47+
- Azure OCR requires an Azure Document Intelligence endpoint in Settings.
48+
- Azure Document Intelligence pricing includes [500 free pages per month](https://azure.microsoft.com/en-us/products/ai-foundry/tools/document-intelligence#Pricing) at the time of writing.
49+
- For API-key auth, set `AZURE_OCR_API_KEY`.
50+
- If `AZURE_OCR_API_KEY` is not set, Azure OCR falls back to Azure identity credentials supported by `DefaultAzureCredential`.
51+
4252
## Run the App
4353

4454
```sh
@@ -97,4 +107,3 @@ uv run pytest -q
97107
- PySide6 ([LGPLv3 License](https://www.gnu.org/licenses/lgpl-3.0.html))
98108
- PySide6-Fluent-Widgets / QFluentWidgets ([Project site](https://qfluentwidgets.com))
99109

100-

README_zh.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
- 预览模式支持渲染视图和原始 Markdown 视图。
1414
- 保存模式支持合并为单文件或分别保存多个文件。
1515
- 常用操作:复制 Markdown、保存输出、返回队列、重新开始。
16-
- 设置项包括输出目录、批处理大小、标题样式、表格样式、主题模式(浅色/深色/跟随系统)。
16+
- 可选 OCR,支持扫描版 PDF 和图片文件,优先使用 Azure Document Intelligence,不可用时回退到本地 Tesseract。
17+
- 设置项包括输出目录、批处理大小、标题样式、表格样式、OCR 和主题模式(浅色/深色/跟随系统)。
1718
- 内置快捷键面板、检查更新入口和关于对话框。
1819

1920
## 安装
@@ -37,6 +38,15 @@ uv sync
3738
pip install -e .[dev]
3839
```
3940

41+
### OCR 说明
42+
43+
- OCR 为可选功能,默认关闭。
44+
- 本地 OCR 需要系统已安装 `tesseract`。可从 [Tesseract 官方项目](https://github.com/tesseract-ocr/tesseract) 安装。如果它不在 `PATH` 中,可以在设置页里指定可执行文件路径。
45+
- Azure OCR 需要在设置页里填写 Azure Document Intelligence 终结点。
46+
- Azure Document Intelligence 定价页面目前标注有 [每月 500 页免费额度](https://azure.microsoft.com/en-us/products/ai-foundry/tools/document-intelligence#Pricing)
47+
- 若使用 API Key 认证,请设置 `AZURE_OCR_API_KEY` 环境变量。
48+
- 如果未设置 `AZURE_OCR_API_KEY`,Azure OCR 会回退到 `DefaultAzureCredential` 支持的 Azure 身份凭据。
49+
4050
## 运行应用
4151

4252
```sh

0 commit comments

Comments
 (0)