Skip to content
Discussion options

You must be logged in to vote

Hi @Aaaattack! I'm Dosu and I’m helping the MinerU team.

目前 MinerU 没有内置跳过已解析文件的功能,每次运行都会覆盖已有输出。如果你想实现这个功能,需要修改代码。

代码修改建议:

  1. 修改 mineru/cli/common.py 中的 prepare_env 函数,添加文件存在检查:
def prepare_env(output_dir, pdf_file_name, parse_method, skip_existing=False):
    local_md_dir = str(os.path.join(output_dir, pdf_file_name, parse_method))
    local_image_dir = os.path.join(str(local_md_dir), "images")
    
    # 检查是否已解析
    md_file = os.path.join(local_md_dir, f"{pdf_file_name}.md")
    if skip_existing and os.path.exists(md_file):
        return None, None  # 返回 None 表示跳过
    
    os.makedirs(local_image_dir, exist_ok=True)
    os.makedirs(local_md_dir, exist_ok=True)
    return local_image_dir, 

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Aaaattack
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant