-
Notifications
You must be signed in to change notification settings - Fork 169
PDF2VQA 重构 #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF2VQA 重构 #443
Conversation
| # Save the updated dataframe to the output file | ||
| output_file = storage.write(dataframe) | ||
| return output_key | ||
| class ChunkedPromptedGenerator(OperatorABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉ChunkedPromptedGenerator可以专门起一个文件,chuncked_prompted_generator。我们一般一个文件就放一个类。类名和文件名几乎一样(文件下划线命名,类驼峰命名)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已挪到单独文件
| else: | ||
| mid = len(text) // 2 | ||
| left, right = text[:mid], text[mid:] | ||
| return self._split_recursive(left) + self._split_recursive(right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么是二分递归呢?直接用chunk_len切可以吗?另外需要注明这里的chunk_len是什么,是字符数还是token数。韩朝阳的算子是根据qwen分词器的token数算的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 用二分是可以调用tokenizer次数少一点,否则得一个一个字符移进,调用tokenizer算长度
- 目前的chunk_len是token数,凡是支持
len(self.enc.encode(text))这种格式的都是可以的,比如常用的tiktoken, autotokenizer都可以。现在默认用的是tiktoken.get_encoding("cl100k_base").
| self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False) | ||
| def forward(self): | ||
| # 单一算子:包含预处理、QA提取、后处理的所有功能 | ||
| self.mineru_executor.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里我看明白了。但是最好写一下注释,为什么要做两次,因为question,answer都是在做一样的操作。尽量user friendly一点。毕竟user并不知道这个新算子/pipeline是什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加注释
|
最后建议后续尽快改一下pipeline和operator的doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个算子是来转格式,如果作为算子存在,也遵循我们的算子命名规矩吧,比如文件名叫mineru_to_llm_formatter,类名一样但是驼峰
对PDF2VQA pipeline进行了大幅度的重构,复用现有dataflow算子。
Bug 修正: