docs: update README

ChenZiHong-Gavin · ChenZiHong-Gavin · commit e95165526bd6 · 2025-12-01T19:32:54.000+08:00
diff --git a/README.md b/README.md
@@ -62,13 +62,14 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
 
 ## 📌 Latest Updates
 
+- **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
 - **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
 - **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
-- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
 
 <details>
 <summary>History</summary>
 
+- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
 - **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
 - **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
 - **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
@@ -82,9 +83,10 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
 We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types.
 Users can flexibly configure according to the needs of synthetic data.
 
-| Inference Server                             | Api Server                                                                     | Inference Client                                           | Input File Format                  | Data Modal    | Data Format                  | Data Type                                       |
-|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|---------------|------------------------------|-------------------------------------------------|
-| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>IMAGE | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
+
+| Inference Server                             | Api Server                                                                     | Inference Client                                           | Data Source                                                                                                                                                                                                                                                                           | Data Modal    | Data Type                                       |
+|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
+| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | Files(CSV, JSON, PDF, TXT, etc.)<br>Databases([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
 
 <!-- links -->
 [hf]: https://huggingface.co/docs/transformers/index
@@ -93,6 +95,13 @@ Users can flexibly configure according to the needs of synthetic data.
 [oai]: https://openai.com
 [az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
 [ol]: https://ollama.com
+[uniprot]: https://www.uniprot.org/
+[ncbi]: https://www.ncbi.nlm.nih.gov/
+[rnacentral]: https://rnacentral.org/
+[wiki]: https://www.wikipedia.org/
+[bing]: https://www.bing.com/
+[google]: https://www.google.com
+
 
 <!-- icons -->
 [hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
@@ -102,6 +111,12 @@ Users can flexibly configure according to the needs of synthetic data.
 [az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
 [ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
 
+[uniprot-icon]: https://www.google.com/s2/favicons?domain=https://www.uniprot.org
+[ncbi-icon]: https://www.google.com/s2/favicons?domain=https://www.ncbi.nlm.nih.gov/
+[rnacentral]: https://www.google.com/s2/favicons?domain=https://rnacentral.org/
+[wiki-icon]: https://www.google.com/s2/favicons?domain=https://www.wikipedia.org/
+[bing-icon]: https://www.google.com/s2/favicons?domain=https://www.bing.com/
+[google-icon]: https://www.google.com/s2/favicons?domain=https://www.google.com
 
 
 ## 🚀 Quick Start
diff --git a/README_zh.md b/README_zh.md
@@ -62,13 +62,14 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 在数据生成后，您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 和 [xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
 
 ## 📌 最新更新
-- **2025.10.30** 我们支持多种新的 LLM 客户端和推理后端，包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) 和 [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
+- **2025.12.1**：新增对 [NCBI](https://www.ncbi.nlm.nih.gov/) 和 [RNAcentral](https://rnacentral.org/) 数据库的检索支持，现在可以从这些生物信息学数据库中提取DNA和RNA数据。
+- **2025.10.30**：我们支持多种新的 LLM 客户端和推理后端，包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) 和 [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py)。
 - **2025.10.23**：我们现在支持视觉问答（VQA）数据生成。运行脚本：`bash scripts/generate/generate_vqa.sh`。
-- **2025.10.21**：我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
 
 <details>
 <summary>历史更新</summary>
 
+- **2025.10.21**：我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
 - **2025.09.29**：我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) 和 [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
 - **2025.08.14**：支持利用 Leiden 社区发现算法对知识图谱进行社区划分，合成 CoT 数据。
 - **2025.07.31**：新增 Google、Bing、Wikipedia 和 UniProt 作为搜索后端，帮助填补数据缺口。  
@@ -81,9 +82,9 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。
 可以根据合成数据的需求进行灵活配置。
 
-| 推理服务器                                        | API 服务器                                                                        | 推理客户端                                                      | 输入文件格式                             | 数据模态         | 输出数据格式                       | 输出数据类型                                          |
-|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|--------------|------------------------------|-------------------------------------------------|
-| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>TEXT | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
+| 推理服务器                                        | API 服务器                                                                        | 推理客户端                                                      | 输入文件格式                                                                                                                                                                                                                                                   | 数据模态          | 输出数据类型                                          |
+|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
+| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | 文件(CSV, JSON, JSONL, PDF, TXT等)<br>数据库([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>搜索引擎([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>知识图谱([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
 
 <!-- links -->
 [hf]: https://huggingface.co/docs/transformers/index
@@ -92,6 +93,13 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 [oai]: https://openai.com
 [az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
 [ol]: https://ollama.com
+[uniprot]: https://www.uniprot.org/
+[ncbi]: https://www.ncbi.nlm.nih.gov/
+[rnacentral]: https://rnacentral.org/
+[wiki]: https://www.wikipedia.org/
+[bing]: https://www.bing.com/
+[google]: https://www.google.com
+
 
 <!-- icons -->
 [hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
@@ -101,6 +109,13 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 [az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
 [ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
 
+[uniprot-icon]: https://www.google.com/s2/favicons?domain=https://www.uniprot.org
+[ncbi-icon]: https://www.google.com/s2/favicons?domain=https://www.ncbi.nlm.nih.gov/
+[rnacentral]: https://www.google.com/s2/favicons?domain=https://rnacentral.org/
+[wiki-icon]: https://www.google.com/s2/favicons?domain=https://www.wikipedia.org/
+[bing-icon]: https://www.google.com/s2/favicons?domain=https://www.bing.com/
+[google-icon]: https://www.google.com/s2/favicons?domain=https://www.google.com
+
 
 ## 🚀 快速开始