Skip to content

Conversation

@fatty-belly
Copy link
Contributor

@fatty-belly fatty-belly commented Jan 16, 2026

对PDF2VQA pipeline进行了大幅度的重构,复用现有dataflow算子。

Bug 修正:

  1. 现在没有识别出任何问题时会输出空文件,而不是报错。
  2. 改进了问答对的章节匹配逻辑
  3. 修正pipeline example文件路径。

@fatty-belly fatty-belly changed the title Pdf2vqa 的一些修正 PDF2VQA 重构 Jan 19, 2026
# Save the updated dataframe to the output file
output_file = storage.write(dataframe)
return output_key
class ChunkedPromptedGenerator(OperatorABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉ChunkedPromptedGenerator可以专门起一个文件,chuncked_prompted_generator。我们一般一个文件就放一个类。类名和文件名几乎一样(文件下划线命名,类驼峰命名)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已挪到单独文件

else:
mid = len(text) // 2
left, right = text[:mid], text[mid:]
return self._split_recursive(left) + self._split_recursive(right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么是二分递归呢?直接用chunk_len切可以吗?另外需要注明这里的chunk_len是什么,是字符数还是token数。韩朝阳的算子是根据qwen分词器的token数算的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 用二分是可以调用tokenizer次数少一点,否则得一个一个字符移进,调用tokenizer算长度
  2. 目前的chunk_len是token数,凡是支持len(self.enc.encode(text))这种格式的都是可以的,比如常用的tiktoken, autotokenizer都可以。现在默认用的是tiktoken.get_encoding("cl100k_base").

self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False)
def forward(self):
# 单一算子:包含预处理、QA提取、后处理的所有功能
self.mineru_executor.run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我看明白了。但是最好写一下注释,为什么要做两次,因为question,answer都是在做一样的操作。尽量user friendly一点。毕竟user并不知道这个新算子/pipeline是什么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加注释

@haolpku
Copy link
Contributor

haolpku commented Jan 20, 2026

最后建议后续尽快改一下pipeline和operator的doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个算子是来转格式,如果作为算子存在,也遵循我们的算子命名规矩吧,比如文件名叫mineru_to_llm_formatter,类名一样但是驼峰

@fatty-belly fatty-belly merged commit d250586 into OpenDCAI:main Jan 20, 2026
9 checks passed
@fatty-belly fatty-belly deleted the pdf2vqa_dev branch January 20, 2026 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants