在处理文本pdf时，对于表格可以直接抽取文本么？ #4664

Reiji777 · 2026-03-26T13:13:54Z

Reiji777
Mar 26, 2026

目前我使用的参数是：
"backend": "hybrid-auto-engine",
"parse_method": "auto",
"formula_enable": False,
"table_enable": True,

我发现提取出来的表格的文字是ocr出来的，而不是直接抽取出来的。

@Reiji777 · 2026-03-26T13:16:42Z

dosubot[bot]
bot Mar 26, 2026

Hi @Reiji777! I'm Dosu and I’m helping the MinerU team.

是的，MinerU 支持直接从文本型 PDF 中提取表格文字，而不使用 OCR。

您遇到的问题是因为 parse_method='auto' 模式下，MinerU 会自动判断 PDF 是否需要 OCR，可能会错误地对文本型 PDF 使用 OCR。

解决方法：将 parse_method 改为 'txt'，强制直接提取文本：

"backend": "hybrid-auto-engine",
"parse_method": "txt",   # 强制直接文本提取，不使用OCR
"formula_enable": False,
"table_enable": True,

parse_method 的有效值：

'auto'：自动检测（默认）
'txt'：强制直接文本提取
'ocr'：强制使用 OCR

参考：相关讨论

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

Reiji777 Mar 26, 2026
Author

@dosu 我测试了设置成txt，但表格里的文字依然是OCR出来的

dosubot[bot] bot Mar 26, 2026

感谢反馈。这个问题是因为 hybrid-auto-engine 后端在使用 VLM 模式时，VLM 模式无法提取 PDF 原生文本层，即使设置了 parse_method='txt' 也会使用 OCR。

解决方案：设置环境变量强制使用 pipeline 模式：

export MINERU_HYBRID_FORCE_PIPELINE_ENABLE=1

这个环境变量会禁用自动 VLM OCR 切换，强制使用 pipeline 模式进行文本提取。

或者，您也可以直接使用 pipeline 后端替代 hybrid-auto-engine：

"backend": "pipeline",
"parse_method": "txt",
"table_enable": True,

您可以试试这两种方法，看看表格文本是否能直接从 PDF 提取了。

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在处理文本pdf时，对于表格可以直接抽取文本么？ #4664

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

在处理文本pdf时，对于表格可以直接抽取文本么？ #4664

Uh oh!

Reiji777 Mar 26, 2026

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] bot Mar 26, 2026

Uh oh!

Reiji777 Mar 26, 2026 Author

Uh oh!

dosubot[bot] bot Mar 26, 2026

Reiji777
Mar 26, 2026

Replies: 1 comment 2 replies

dosubot[bot]
bot Mar 26, 2026

Reiji777 Mar 26, 2026
Author