分析的文字结果，有没有标识可以区分是否经过了ocr？ #4556

yinghao-xue · 2026-02-27T09:16:03Z

yinghao-xue
Feb 27, 2026

从pdf文件分析出来的文字清单，有没有办法区分出来哪些在原本pdf中就是文字，哪些是通过ocr识别出来的？

Answered by dosubot[bot]

Feb 27, 2026

Hi @yinghao-xue! I'm Dosu and I’m helping the MinerU team.

是的，MinerU 的输出中有标识可以区分原生 PDF 文字和 OCR 识别的文字。

主要通过 category_id 字段来区分：

category_id: 15 - 表示 OCR 识别出来的文字（OcrText）
category_id: 16 - 表示低置信度的 OCR 文字（LowScoreText，当置信度低于 0.5 时）
其他 category_id 值（0-14） - 代表原生 PDF 文字和其他内容类型（如 0=标题，1=普通文本等）

此外，OCR 识别的文字还包含 score 字段表示置信度分数。

注意：这些标识在 model.json 输出文件中最完整。最终的 Markdown 和简化的 content_list.json 输出可能会合并所有文字来源，不再明确区分。如果需要区分文字来源，建议使用 model.json 文件。

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

View full answer

@yinghao-xue · 2026-02-27T09:18:58Z

dosubot[bot]
bot Feb 27, 2026

Hi @yinghao-xue! I'm Dosu and I’m helping the MinerU team.

是的，MinerU 的输出中有标识可以区分原生 PDF 文字和 OCR 识别的文字。

主要通过 category_id 字段来区分：

category_id: 15 - 表示 OCR 识别出来的文字（OcrText）
category_id: 16 - 表示低置信度的 OCR 文字（LowScoreText，当置信度低于 0.5 时）
其他 category_id 值（0-14） - 代表原生 PDF 文字和其他内容类型（如 0=标题，1=普通文本等）

此外，OCR 识别的文字还包含 score 字段表示置信度分数。

注意：这些标识在 model.json 输出文件中最完整。最终的 Markdown 和简化的 content_list.json 输出可能会合并所有文字来源，不再明确区分。如果需要区分文字来源，建议使用 model.json 文件。

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

yinghao-xue Feb 28, 2026
Author

多谢了，我去试下

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分析的文字结果，有没有标识可以区分是否经过了ocr？ #4556

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

分析的文字结果，有没有标识可以区分是否经过了ocr？ #4556

Uh oh!

yinghao-xue Feb 27, 2026

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Feb 27, 2026

Uh oh!

yinghao-xue Feb 28, 2026 Author

yinghao-xue
Feb 27, 2026

Replies: 1 comment 1 reply

dosubot[bot]
bot Feb 27, 2026

yinghao-xue Feb 28, 2026
Author