Skip to content

Commit 32467ff

Browse files
ZhaoyangHan04ZhaoyangHan04
andauthored
add support for fetching from arxiv links (#101)
Co-authored-by: ZhaoyangHan04 <319926404@qq.com>
1 parent 5d6d29d commit 32467ff

File tree

2 files changed

+16
-0
lines changed

2 files changed

+16
-0
lines changed

docs/en/notes/guide/quickstart/knowledge_cleaning.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,14 @@ For detailed descriptions of each operator, refer to the "Knowledge Base Cleanin
7272
> ...
7373
> ```
7474
>
75+
>
76+
> Or you can just put the URLs of the papers in the JSONL file. For example:
77+
> ```jsonl
78+
> {"raw_content": "https://arxiv.org/pdf/2505.07773"}
79+
> {"raw_content": "https://arxiv.org/pdf/2503.09516"}
80+
> ...
81+
> ```
82+
>
7583
> Then, configure your path file `/path/to/all_pdf.jsonl` as shown below to enable batch cleaning of the knowledge base.
7684
>
7785
> ```python

docs/zh/notes/guide/quickstart/knowledge_cleaning.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,16 @@ python kbcleaning_pipeline_batch_sglang.py
7474
> ...
7575
> ```
7676
>
77+
> 或者您可以直接把**论文对应的URL**整理成如下格式:
78+
> ```jsonl
79+
> {"raw_content": "https://arxiv.org/pdf/2505.07773"}
80+
> {"raw_content": "https://arxiv.org/pdf/2503.09516"}
81+
> ...
82+
> ```
83+
>
7784
> 并通过下面方式配置您的路径文件/path/to/all_pdf.jsonl,即可实现大批量清洗知识库。
7885
>
86+
>
7987
> ```python
8088
> self.storage = FileStorage(
8189
> first_entry_file_name="/path/to/all_pdf.jsonl",

0 commit comments

Comments
 (0)