如何跳过已解析好的数据
#4541
Answered
by
dosubot
bot
Return to top
-
|
在我使用时,会遇到一些问题重新跑任务,有一些已经解析好的数据不需要重新解析,不过mineru还是会清除掉之前解好的数据重新解析,我想要跳过这些数据但是在文档中没有看到要怎么设置,请问目前支持这个feature吗?不支持的话我应该在哪里去修改代码逻辑呢 |
Beta Was this translation helpful? Give feedback.
Answered by
dosubot[bot]
Feb 16, 2026
Replies: 1 comment
-
|
Hi @Aaaattack! I'm Dosu and I’m helping the MinerU team. 目前 MinerU 没有内置跳过已解析文件的功能,每次运行都会覆盖已有输出。如果你想实现这个功能,需要修改代码。 代码修改建议:
def prepare_env(output_dir, pdf_file_name, parse_method, skip_existing=False):
local_md_dir = str(os.path.join(output_dir, pdf_file_name, parse_method))
local_image_dir = os.path.join(str(local_md_dir), "images")
# 检查是否已解析
md_file = os.path.join(local_md_dir, f"{pdf_file_name}.md")
if skip_existing and os.path.exists(md_file):
return None, None # 返回 None 表示跳过
os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True)
return local_image_dir, local_md_dir
@click.option(
'--skip-existing',
is_flag=True,
default=False,
help='Skip files that have already been parsed',
)这样修改后,运行时添加 To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
Aaaattack
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @Aaaattack! I'm Dosu and I’m helping the MinerU team.
目前 MinerU 没有内置跳过已解析文件的功能,每次运行都会覆盖已有输出。如果你想实现这个功能,需要修改代码。
代码修改建议:
mineru/cli/common.py中的prepare_env函数,添加文件存在检查: