add support for fetching from arxiv links (#101)

ZhaoyangHan04 · ZhaoyangHan04 · web-flow · commit 32467ff40427 · 2025-07-24T11:42:41.000+08:00
Co-authored-by: ZhaoyangHan04 &lt;319926404@qq.com&gt;
diff --git a/docs/en/notes/guide/quickstart/knowledge_cleaning.md b/docs/en/notes/guide/quickstart/knowledge_cleaning.md
@@ -72,6 +72,14 @@ For detailed descriptions of each operator, refer to the "Knowledge Base Cleanin
 > ...
 > ```
 >
+>
+> Or you can just put the URLs of the papers in the JSONL file. For example:
+> ```jsonl
+> {"raw_content": "https://arxiv.org/pdf/2505.07773"}
+> {"raw_content": "https://arxiv.org/pdf/2503.09516"}
+> ...
+> ```
+>
 > Then, configure your path file `/path/to/all_pdf.jsonl` as shown below to enable batch cleaning of the knowledge base.
 >
 > ```python
diff --git a/docs/zh/notes/guide/quickstart/knowledge_cleaning.md b/docs/zh/notes/guide/quickstart/knowledge_cleaning.md
@@ -74,8 +74,16 @@ python kbcleaning_pipeline_batch_sglang.py
 > ...
 > ```
 >
+> 或者您可以直接把**论文对应的URL**整理成如下格式：
+> ```jsonl
+> {"raw_content": "https://arxiv.org/pdf/2505.07773"}
+> {"raw_content": "https://arxiv.org/pdf/2503.09516"}
+> ...
+> ```
+>
 > 并通过下面方式配置您的路径文件/path/to/all_pdf.jsonl，即可实现大批量清洗知识库。
 >
+>
 > ```python
 > self.storage = FileStorage(
 >     first_entry_file_name="/path/to/all_pdf.jsonl",