|
1362 | 1362 | ''') |
1363 | 1363 |
|
1364 | 1364 | add_chinese_doc('rag.readers.MineruPDFReader', '''\ |
1365 | | -用于通过 MinerU 服务解析 PDF 文件内容的模块。支持上传文件或通过 URL 方式调用解析接口,解析结果经过回调函数处理成文档节点列表。 |
1366 | | -
|
1367 | | -Args: |
1368 | | - url (str): MineruPDFReader 服务的接口 URL。 |
1369 | | - upload_mode (bool): 是否采用文件上传模式调用接口,默认为 False,即通过 JSON 请求文件路径。 |
1370 | | - extract_table (bool): 是否提取表格,默认为 True。 |
1371 | | - extract_formula (bool): 是否提取公式,默认为 True。 |
1372 | | - split_doc (bool): 是否分割文档,默认为 True。 |
1373 | | - post_func (Optional[Callable]): 后处理函数。 |
| 1365 | +基于Mineru服务的PDF解析器,通过调用Mineru服务的API来解析PDF文件,支持丰富的文档结构识别。 |
| 1366 | +
|
| 1367 | +Args: |
| 1368 | + url (str): Mineru服务的完整API端点URL。 |
| 1369 | + backend (str, optional): 解析引擎类型。可选值: |
| 1370 | + - 'pipeline': 标准处理流水线 |
| 1371 | + - 'vlm-transformers': 基于Transformers的视觉语言模型 |
| 1372 | + - 'vlm-vllm-async-engine': 基于异步VLLM的视觉语言模型 |
| 1373 | + 默认为 'pipeline'。 |
| 1374 | + upload_mode (bool, optional): 文件传输模式。 |
| 1375 | + - True: 使用multipart/form-data上传文件内容 |
| 1376 | + - False: 通过文件路径传递(需确保服务端可访问该路径) |
| 1377 | + 默认为 False。 |
| 1378 | + extract_table (bool, optional): 是否提取表格内容并转换为Markdown格式。默认为 True。 |
| 1379 | + extract_formula (bool, optional): 是否提取公式文本。 |
| 1380 | + - True: 提取为LaTeX等文本格式 |
| 1381 | + - False: 将公式保留为图片 |
| 1382 | + 默认为 True。 |
| 1383 | + split_doc (bool, optional): 是否将文档分割为多个DocNode节点。默认为 True。 |
| 1384 | + clean_content (bool, optional): 是否清理冗余内容(页眉、页脚、页码等)。默认为 True。 |
| 1385 | + post_func (Optional[Callable[[List[DocNode]], Any]], optional): 后处理函数, |
| 1386 | + 接收DocNode列表作为参数,用于自定义结果处理。默认为 None。 |
1374 | 1387 | ''') |
1375 | 1388 |
|
1376 | 1389 | add_english_doc('rag.readers.MineruPDFReader', '''\ |
1377 | | -Module to parse PDF content via the MineruPDFReader service. Supports file upload or URL-based parsing, with a callback to process the parsed elements into document nodes. |
1378 | | -
|
1379 | | -Args: |
1380 | | - url (str): The MineruPDFReader service API URL. |
1381 | | - upload_mode (bool): Whether to use file upload mode for the API call. Default is False, meaning JSON request with file path. |
1382 | | - extract_table (bool): Whether to extract tables. Default is True. |
1383 | | - extract_formula (bool): Whether to extract formulas. Default is True. |
1384 | | - split_doc (bool): Whether to split the document. Default is True. |
1385 | | - post_func (Optional[Callable]): Post-processing function. |
| 1390 | +Reader for PDF files by calling the Mineru service's API. |
| 1391 | +
|
| 1392 | +Args: |
| 1393 | + url (str): The complete API endpoint URL for the Mineru service. |
| 1394 | + backend (str, optional): Type of parsing engine. Available options: |
| 1395 | + - 'pipeline': Standard processing pipeline |
| 1396 | + - 'vlm-transformers': Vision-language model based on Transformers |
| 1397 | + - 'vlm-vllm-async-engine': Vision-language model based on async VLLM engine |
| 1398 | + Defaults to 'pipeline'. |
| 1399 | + upload_mode (bool, optional): File transfer mode. |
| 1400 | + - True: Upload file content using multipart/form-data |
| 1401 | + - False: Pass by file path (ensure the server can access the path) |
| 1402 | + Defaults to False. |
| 1403 | + extract_table (bool, optional): Whether to extract table content and convert |
| 1404 | + to Markdown format. Defaults to True. |
| 1405 | + extract_formula (bool, optional): Whether to extract formula text. |
| 1406 | + - True: Extract as text format (e.g., LaTeX) |
| 1407 | + - False: Keep formulas as images |
| 1408 | + Defaults to True. |
| 1409 | + split_doc (bool, optional): Whether to split the document into multiple |
| 1410 | + DocNode nodes. Defaults to True. |
| 1411 | + clean_content (bool, optional): Whether to clean redundant content |
| 1412 | + (headers, footers, page numbers, etc.). Defaults to True. |
| 1413 | + post_func (Optional[Callable[[List[DocNode]], Any]], optional): Post-processing |
| 1414 | + function that takes a list of DocNodes as input for custom result handling. |
| 1415 | + Defaults to None. |
1386 | 1416 | ''') |
1387 | 1417 |
|
| 1418 | + |
1388 | 1419 | add_chinese_doc('rag.readers.MarkdownReader', '''\ |
1389 | 1420 | 用于读取和解析 Markdown 文件的模块。支持去除超链接和图片,按标题和内容将 Markdown 划分成若干文本段落节点。 |
1390 | 1421 |
|
@@ -2864,6 +2895,11 @@ def _lazy_load_data(self, file_paths: list, **kwargs) -> Iterable[DocNode]: |
2864 | 2895 | documents = reader.forward(file_paths=["doc1.txt", "doc2.txt"]) |
2865 | 2896 | ''') |
2866 | 2897 |
|
| 2898 | +add_example('rag.readers.MineruPDFReader', '''\ |
| 2899 | +from lazyllm.tools.rag.readers import MineruPDFReader |
| 2900 | +reader = MineruPDFReader("http://0.0.0.0:8888") # Mineru server address |
| 2901 | +nodes = reader("path/to/pdf") |
| 2902 | +''') |
2867 | 2903 |
|
2868 | 2904 | add_chinese_doc('rag.doc_node.QADocNode', '''\ |
2869 | 2905 | 问答文档节点类,用于存储问答对数据。 |
|
0 commit comments