Skip to content

Commit b8ad101

Browse files
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen into baseline
2 parents 661f730 + 720ba9b commit b8ad101

File tree

117 files changed

+3527
-1343
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+3527
-1343
lines changed

.env.example

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,14 @@ TRAINEE_MODEL=gpt-4o-mini
1414
TRAINEE_BASE_URL=
1515
TRAINEE_API_KEY=
1616

17+
# azure_openai_api
18+
# SYNTHESIZER_BACKEND=azure_openai_api
19+
# The following is the same as your "Deployment name" in Azure
20+
# SYNTHESIZER_MODEL=<your-deployment-name>
21+
# SYNTHESIZER_BASE_URL=https://<your-resource-name>.openai.azure.com/openai/deployments/<your-deployment-name>/chat/completions
22+
# SYNTHESIZER_API_KEY=
23+
# SYNTHESIZER_API_VERSION=<api-version>
24+
1725
# # ollama_api
1826
# SYNTHESIZER_BACKEND=ollama_api
1927
# SYNTHESIZER_MODEL=gemma3

.github/contributing.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
## Contribution Guide
2+
Here are the steps to contribute to this project:
3+
4+
1. Star this repository.
5+
2. Fork this repository.
6+
7+
Type the following command on Git bash console:
8+
```bash
9+
git clone https://github.com/open-sciencelab/GraphGen.git
10+
```
11+
12+
3. Create a new branch
13+
14+
Now before making changes to the files, go to your terminal under the repo you just cloned, and type the following:
15+
16+
```bash
17+
git checkout -b add-my-name
18+
```
19+
20+
By running the above command, you just created a new branch called add-my-name and checked it out, what this does is that it creates a new branch with the commit history of the master branch or the branch that you were on previously.
21+
22+
4. Make your changes and push your code.
23+
24+
```
25+
git add .
26+
git commit -m "xxx"
27+
git push
28+
```
29+
30+
This will create a new commit with the changes you made.
31+
32+
5. Now create a pull request and add the title.
33+
34+
Sit back and relax while your pull request is being reviewed and merged.

.pylintrc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ source-roots=
100100

101101
# When enabled, pylint would attempt to guess common misconfiguration and emit
102102
# user-friendly hints instead of false-positive error messages.
103-
suggestion-mode=yes
103+
# suggestion-mode=yes
104104

105105
# Allow loading of arbitrary C extensions. Extensions are imported into the
106106
# active Python interpreter and may run arbitrary code.
@@ -452,6 +452,7 @@ disable=raw-checker-failed,
452452
R0917, # Too many positional arguments (6/5) (too-many-positional-arguments)
453453
C0103,
454454
E0401,
455+
W0718, # Catching too general exception Exception (broad-except)
455456

456457
# Enable the message, report, category or checker with the given id(s). You can
457458
# either give multiple identifier separated by comma (,) or put this option

README.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616

1717
[![Hugging Face](https://img.shields.io/badge/Demo-on%20HF-blue?logo=huggingface&logoColor=yellow)](https://huggingface.co/spaces/chenzihong/GraphGen)
1818
[![Model Scope](https://img.shields.io/badge/%F0%9F%A4%96%20Demo-on%20MS-green)](https://modelscope.cn/studios/chenzihong/GraphGen)
19-
[![OpenXLab](https://img.shields.io/badge/Demo-on%20OpenXLab-blue?logo=openxlab&logoColor=yellow)](https://g-app-center-120612-6433-jpdvmvp.openxlab.space)
2019

2120

2221
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
@@ -63,13 +62,14 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
6362

6463
## 📌 Latest Updates
6564

65+
- **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
6666
- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
6767
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
68-
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
6968

7069
<details>
7170
<summary>History</summary>
7271

72+
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
7373
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
7474
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
7575
- **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
@@ -83,9 +83,10 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
8383
We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types.
8484
Users can flexibly configure according to the needs of synthetic data.
8585

86-
| Inference Server | Api Server | Inference Client | Input File Format | Data Modal | Data Format | Data Type |
87-
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|---------------|------------------------------|-------------------------------------------------|
88-
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>IMAGE | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
86+
87+
| Inference Server | Api Server | Inference Client | Data Source | Data Modal | Data Type |
88+
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
89+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | Files(CSV, JSON, PDF, TXT, etc.)<br>Databases([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
8990

9091
<!-- links -->
9192
[hf]: https://huggingface.co/docs/transformers/index
@@ -94,6 +95,13 @@ Users can flexibly configure according to the needs of synthetic data.
9495
[oai]: https://openai.com
9596
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
9697
[ol]: https://ollama.com
98+
[uniprot]: https://www.uniprot.org/
99+
[ncbi]: https://www.ncbi.nlm.nih.gov/
100+
[rnacentral]: https://rnacentral.org/
101+
[wiki]: https://www.wikipedia.org/
102+
[bing]: https://www.bing.com/
103+
[google]: https://www.google.com
104+
97105

98106
<!-- icons -->
99107
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
@@ -103,11 +111,17 @@ Users can flexibly configure according to the needs of synthetic data.
103111
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
104112
[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
105113

114+
[uniprot-icon]: https://www.google.com/s2/favicons?domain=https://www.uniprot.org
115+
[ncbi-icon]: https://www.google.com/s2/favicons?domain=https://www.ncbi.nlm.nih.gov/
116+
[rnacentral-icon]: https://www.google.com/s2/favicons?domain=https://rnacentral.org/
117+
[wiki-icon]: https://www.google.com/s2/favicons?domain=https://www.wikipedia.org/
118+
[bing-icon]: https://www.google.com/s2/favicons?domain=https://www.bing.com/
119+
[google-icon]: https://www.google.com/s2/favicons?domain=https://www.google.com
106120

107121

108122
## 🚀 Quick Start
109123

110-
Experience GraphGen through [Web](https://g-app-center-120612-6433-jpdvmvp.openxlab.space) or [Backup Web Entrance](https://openxlab.org.cn/apps/detail/chenzihonga/GraphGen)
124+
Experience GraphGen Demo through [Huggingface](https://huggingface.co/spaces/chenzihong/GraphGen) or [Modelscope](https://modelscope.cn/studios/chenzihong/GraphGen).
111125

112126
For any questions, please check [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10), open new [issue](https://github.com/open-sciencelab/GraphGen/issues) or join our [wechat group](https://cdn.vansin.top/internlm/dou.jpg) and ask.
113127

@@ -263,4 +277,4 @@ This project is licensed under the [Apache License 2.0](LICENSE).
263277

264278
## 📅 Star History
265279

266-
[![Star History Chart](https://api.star-history.com/svg?repos=open-sciencelab/GraphGen&type=Date)](https://www.star-history.com/#open-sciencelab/GraphGen&Date)
280+
[![Star History Chart](https://api.star-history.com/svg?repos=Intern-Science/GraphGen&type=Date)](https://www.star-history.com/#open-sciencelab/GraphGen&Date)

README_zh.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616

1717
[![Hugging Face](https://img.shields.io/badge/Demo-on%20HF-blue?logo=huggingface&logoColor=yellow)](https://huggingface.co/spaces/chenzihong/GraphGen)
1818
[![Model Scope](https://img.shields.io/badge/%F0%9F%A4%96%20Demo-on%20MS-green)](https://modelscope.cn/studios/chenzihong/GraphGen)
19-
[![OpenXLab](https://img.shields.io/badge/Demo-on%20OpenXLab-blue?logo=openxlab&logoColor=yellow)](https://g-app-center-120612-6433-jpdvmvp.openxlab.space)
2019

2120
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2221

@@ -63,13 +62,14 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
6362
在数据生成后,您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)[xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
6463

6564
## 📌 最新更新
66-
- **2025.10.30** 我们支持多种新的 LLM 客户端和推理后端,包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py)[SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
65+
- **2025.12.1**:新增对 [NCBI](https://www.ncbi.nlm.nih.gov/)[RNAcentral](https://rnacentral.org/) 数据库的检索支持,现在可以从这些生物信息学数据库中提取DNA和RNA数据。
66+
- **2025.10.30**:我们支持多种新的 LLM 客户端和推理后端,包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py)[SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py)
6767
- **2025.10.23**:我们现在支持视觉问答(VQA)数据生成。运行脚本:`bash scripts/generate/generate_vqa.sh`
68-
- **2025.10.21**:我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
6968

7069
<details>
7170
<summary>历史更新</summary>
7271

72+
- **2025.10.21**:我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
7373
- **2025.09.29**:我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen)[ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
7474
- **2025.08.14**:支持利用 Leiden 社区发现算法对知识图谱进行社区划分,合成 CoT 数据。
7575
- **2025.07.31**:新增 Google、Bing、Wikipedia 和 UniProt 作为搜索后端,帮助填补数据缺口。
@@ -82,9 +82,9 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
8282
我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。
8383
可以根据合成数据的需求进行灵活配置。
8484

85-
| 推理服务器 | API 服务器 | 推理客户端 | 输入文件格式 | 数据模态 | 输出数据格式 | 输出数据类型 |
86-
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|--------------|------------------------------|-------------------------------------------------|
87-
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>TEXT | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
85+
| 推理服务器 | API 服务器 | 推理客户端 | 输入文件格式 | 数据模态 | 输出数据类型 |
86+
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
87+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | 文件(CSV, JSON, JSONL, PDF, TXT等)<br>数据库([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>搜索引擎([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>知识图谱([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
8888

8989
<!-- links -->
9090
[hf]: https://huggingface.co/docs/transformers/index
@@ -93,6 +93,13 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
9393
[oai]: https://openai.com
9494
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
9595
[ol]: https://ollama.com
96+
[uniprot]: https://www.uniprot.org/
97+
[ncbi]: https://www.ncbi.nlm.nih.gov/
98+
[rnacentral]: https://rnacentral.org/
99+
[wiki]: https://www.wikipedia.org/
100+
[bing]: https://www.bing.com/
101+
[google]: https://www.google.com
102+
96103

97104
<!-- icons -->
98105
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
@@ -102,10 +109,17 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
102109
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
103110
[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
104111

112+
[uniprot-icon]: https://www.google.com/s2/favicons?domain=https://www.uniprot.org
113+
[ncbi-icon]: https://www.google.com/s2/favicons?domain=https://www.ncbi.nlm.nih.gov/
114+
[rnacentral-icon]: https://www.google.com/s2/favicons?domain=https://rnacentral.org/
115+
[wiki-icon]: https://www.google.com/s2/favicons?domain=https://www.wikipedia.org/
116+
[bing-icon]: https://www.google.com/s2/favicons?domain=https://www.bing.com/
117+
[google-icon]: https://www.google.com/s2/favicons?domain=https://www.google.com
118+
105119

106120
## 🚀 快速开始
107121

108-
通过 [Web](https://g-app-center-120612-6433-jpdvmvp.openxlab.space)[备用 Web 入口](https://openxlab.org.cn/apps/detail/chenzihonga/GraphGen) 体验 GraphGen。
122+
通过 [Huggingface](https://huggingface.co/spaces/chenzihong/GraphGen)[Modelscope](https://modelscope.cn/studios/chenzihong/GraphGen) 体验 GraphGen。
109123

110124
如有任何问题,请查看 [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10)、提交新的 [issue](https://github.com/open-sciencelab/GraphGen/issues) 或加入我们的[微信群](https://cdn.vansin.top/internlm/dou.jpg)咨询。
111125

@@ -259,5 +273,5 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
259273

260274
## 📅 星标历史
261275

262-
[![Star History Chart](https://api.star-history.com/svg?repos=open-sciencelab/GraphGen&type=Date)](https://www.star-history.com/#open-sciencelab/GraphGen&Date)
276+
[![Star History Chart](https://api.star-history.com/svg?repos=Intern-Science/GraphGen&type=Date)](https://www.star-history.com/#open-sciencelab/GraphGen&Date)
263277

baselines/Genie/genie.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ async def process_chunk(content: str):
120120
load_dotenv()
121121

122122
llm_client = OpenAIClient(
123-
model_name=os.getenv("SYNTHESIZER_MODEL"),
123+
model=os.getenv("SYNTHESIZER_MODEL"),
124124
api_key=os.getenv("SYNTHESIZER_API_KEY"),
125125
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
126126
)

baselines/LongForm/longform.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ async def process_chunk(content: str):
8686
load_dotenv()
8787

8888
llm_client = OpenAIClient(
89-
model_name=os.getenv("SYNTHESIZER_MODEL"),
89+
model=os.getenv("SYNTHESIZER_MODEL"),
9090
api_key=os.getenv("SYNTHESIZER_API_KEY"),
9191
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
9292
)

baselines/SELF-QA/self-qa.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ async def process_chunk(content: str):
154154
load_dotenv()
155155

156156
llm_client = OpenAIClient(
157-
model_name=os.getenv("SYNTHESIZER_MODEL"),
157+
model=os.getenv("SYNTHESIZER_MODEL"),
158158
api_key=os.getenv("SYNTHESIZER_API_KEY"),
159159
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
160160
)

baselines/Wrap/wrap.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ async def process_chunk(content: str):
107107
load_dotenv()
108108

109109
llm_client = OpenAIClient(
110-
model_name=os.getenv("SYNTHESIZER_MODEL"),
110+
model=os.getenv("SYNTHESIZER_MODEL"),
111111
api_key=os.getenv("SYNTHESIZER_API_KEY"),
112112
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
113113
)

graphgen/bases/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1+
from .base_extractor import BaseExtractor
12
from .base_generator import BaseGenerator
23
from .base_kg_builder import BaseKGBuilder
34
from .base_llm_wrapper import BaseLLMWrapper
45
from .base_partitioner import BasePartitioner
56
from .base_reader import BaseReader
7+
from .base_searcher import BaseSearcher
68
from .base_splitter import BaseSplitter
79
from .base_storage import (
810
BaseGraphStorage,

0 commit comments

Comments
 (0)