Replies: 3 comments 4 replies
-
|
MinerU 当前并不支持直接解析 Word 文档(.doc 和 .docx),上传 DOCX 文件会提示格式不支持。官方推荐做法是先用 LibreOffice 等工具将 Word 文档转换为 PDF,再用 MinerU 进行解析,具体说明见官方讨论。 如果你需要直接提取 Word 文档的内容和样式(如字体、字号、标题层级、缩进等),建议使用专门的解析工具,比如 python-docx(Python)、Apache POI(Java)、Aspose.Words(商业)、docx4j(Java)。例如用 python-docx 提取段落和样式信息: from docx import Document
doc = Document('your.docx')
for para in doc.paragraphs:
print({
"text": para.text,
"style": para.style.name,
"font": para.runs[0].font.name if para.runs else None,
"size": para.runs[0].font.size.pt if para.runs and para.runs[0].font.size else None,
"indent": para.paragraph_format.left_indent.pt if para.paragraph_format.left_indent else None
})如需将 Word 文档内容转为 Markdown 或 JSON,可用上述工具自定义导出。更多细节和工具推荐见相关讨论。 To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
Conversion of DOCX to PDF format strips all layout information from the original document, which must then be restored through computationally expensive methods in MinerU pipeline. This process is not very optimal. I suggest you use a specialized library like markitdown for DOCX to PDF conversion. |
Beta Was this translation helpful? Give feedback.
-
|
更新3.0版本即可解析docx文档 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
如何用MinerU解析WORD文档(.doc和.docx)?
Beta Was this translation helpful? Give feedback.
All reactions