Skip to content

Commit 4e8fbce

Browse files
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen into copilot/change-code-style-dataclass
2 parents a5ed2f4 + 862e1d4 commit 4e8fbce

File tree

68 files changed

+1474
-219
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+1474
-219
lines changed

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2323

24-
[English](README.md) | [中文](README_ZH.md)
24+
[English](README.md) | [中文](README_zh)
2525

2626
<details close>
2727
<summary><b>📚 Table of Contents</b></summary>
@@ -62,11 +62,20 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
6262

6363
## 📌 Latest Updates
6464

65+
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
66+
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
6567
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
68+
69+
<details>
70+
<summary>History</summary>
71+
6672
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
6773
- **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
6874
- **2025.04.21**: We have released the initial version of GraphGen.
6975

76+
</details>
77+
78+
7079
## 🚀 Quick Start
7180

7281
Experience GraphGen through [Web](https://g-app-center-120612-6433-jpdvmvp.openxlab.space) or [Backup Web Entrance](https://openxlab.org.cn/apps/detail/chenzihonga/GraphGen)

README_ZH.md renamed to README_zh.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2222

23-
[English](README.md) | [中文](README_ZH.md)
23+
[English](README.md) | [中文](README_zh)
2424

2525
<details close>
2626
<summary><b>📚 目录</b></summary>
@@ -63,11 +63,20 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
6363

6464
## 📌 最新更新
6565

66+
- **2025.10.23**:我们现在支持视觉问答(VQA)数据生成。运行脚本:`bash scripts/generate/generate_vqa.sh`
67+
- **2025.10.21**:我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
6668
- **2025.09.29**:我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen)[ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
69+
70+
<details>
71+
<summary>历史更新</summary>
72+
6773
- **2025.08.14**:支持利用 Leiden 社区发现算法对知识图谱进行社区划分,合成 CoT 数据。
6874
- **2025.07.31**:新增 Google、Bing、Wikipedia 和 UniProt 作为搜索后端,帮助填补数据缺口。
6975
- **2025.04.21**:发布 GraphGen 初始版本。
7076

77+
</details>
78+
79+
7180
## 🚀 快速开始
7281

7382
通过 [Web](https://g-app-center-120612-6433-jpdvmvp.openxlab.space)[备用 Web 入口](https://openxlab.org.cn/apps/detail/chenzihonga/GraphGen) 体验 GraphGen。

graphgen/bases/base_reader.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1+
import os
12
from abc import ABC, abstractmethod
23
from typing import Any, Dict, List
34

5+
import requests
6+
47

58
class BaseReader(ABC):
69
"""
@@ -18,3 +21,45 @@ def read(self, file_path: str) -> List[Dict[str, Any]]:
1821
:param file_path: Path to the input file.
1922
:return: List of dictionaries containing the data.
2023
"""
24+
25+
@staticmethod
26+
def filter(data: List[dict]) -> List[dict]:
27+
"""
28+
Filter out entries with empty or missing text in the specified column.
29+
30+
:param data: List of dictionaries containing the data.
31+
:return: Filtered list of dictionaries.
32+
"""
33+
34+
def _image_exists(path_or_url: str, timeout: int = 3) -> bool:
35+
"""
36+
Check if an image exists at the given local path or URL.
37+
:param path_or_url: Local file path or remote URL of the image.
38+
:param timeout: Timeout for remote URL requests in seconds.
39+
:return: True if the image exists, False otherwise.
40+
"""
41+
if not path_or_url:
42+
return False
43+
if not path_or_url.startswith(("http://", "https://", "ftp://")):
44+
path = path_or_url.replace("file://", "", 1)
45+
path = os.path.abspath(path)
46+
return os.path.isfile(path)
47+
try:
48+
resp = requests.head(path_or_url, allow_redirects=True, timeout=timeout)
49+
return resp.status_code == 200
50+
except requests.RequestException:
51+
return False
52+
53+
filtered_data = []
54+
for item in data:
55+
if item.get("type") == "text":
56+
content = item.get("content", "").strip()
57+
if content:
58+
filtered_data.append(item)
59+
elif item.get("type") in ("image", "table", "equation"):
60+
img_path = item.get("img_path")
61+
if _image_exists(img_path):
62+
filtered_data.append(item)
63+
else:
64+
filtered_data.append(item)
65+
return filtered_data

graphgen/bases/datatypes.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,18 @@
77
class Chunk:
88
id: str
99
content: str
10+
type: str
1011
metadata: dict = field(default_factory=dict)
1112

13+
@staticmethod
14+
def from_dict(key: str, data: dict) -> "Chunk":
15+
return Chunk(
16+
id=key,
17+
content=data.get("content", ""),
18+
type=data.get("type", "unknown"),
19+
metadata={k: v for k, v in data.items() if k != "content"},
20+
)
21+
1222

1323
@dataclass
1424
class QAPair:

graphgen/configs/aggregated_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
read:
2-
input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
2+
input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
33
split:
44
chunk_size: 1024 # chunk size for text splitting
55
chunk_overlap: 100 # chunk overlap for text splitting
@@ -18,5 +18,5 @@ partition: # graph partition configuration
1818
max_tokens_per_community: 10240 # max tokens per community
1919
unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
2020
generate:
21-
mode: aggregated # atomic, aggregated, multi_hop, cot
21+
mode: aggregated # atomic, aggregated, multi_hop, cot, vqa
2222
data_format: ChatML # Alpaca, Sharegpt, ChatML

graphgen/configs/atomic_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
read:
2-
input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv. See resources/input_examples for examples
2+
input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv, pdf. See resources/input_examples for examples
33
split:
44
chunk_size: 1024 # chunk size for text splitting
55
chunk_overlap: 100 # chunk overlap for text splitting
@@ -15,5 +15,5 @@ partition: # graph partition configuration
1515
method_params:
1616
max_units_per_community: 1 # atomic partition, one node or edge per community
1717
generate:
18-
mode: atomic # atomic, aggregated, multi_hop, cot
18+
mode: atomic # atomic, aggregated, multi_hop, cot, vqa
1919
data_format: Alpaca # Alpaca, Sharegpt, ChatML

graphgen/configs/cot_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
read:
2-
input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt. See resources/input_examples for examples
2+
input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
33
split:
44
chunk_size: 1024 # chunk size for text splitting
55
chunk_overlap: 100 # chunk overlap for text splitting
@@ -15,5 +15,5 @@ partition: # graph partition configuration
1515
use_lcc: false # whether to use the largest connected component
1616
random_seed: 42 # random seed for partitioning
1717
generate:
18-
mode: cot # atomic, aggregated, multi_hop, cot
18+
mode: cot # atomic, aggregated, multi_hop, cot, vqa
1919
data_format: Sharegpt # Alpaca, Sharegpt, ChatML

graphgen/configs/multi_hop_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
read:
2-
input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt. See resources/input_examples for examples
2+
input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
33
split:
44
chunk_size: 1024 # chunk size for text splitting
55
chunk_overlap: 100 # chunk overlap for text splitting
@@ -18,5 +18,5 @@ partition: # graph partition configuration
1818
max_tokens_per_community: 10240 # max tokens per community
1919
unit_sampling: random # unit sampling strategy, support: random, max_loss, min_loss
2020
generate:
21-
mode: multi_hop # strategy for generating multi-hop QA pairs
21+
mode: multi_hop # atomic, aggregated, multi_hop, cot, vqa
2222
data_format: ChatML # Alpaca, Sharegpt, ChatML

graphgen/configs/vqa_config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
read:
2+
input_file: resources/input_examples/vqa_demo.json # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
3+
split:
4+
chunk_size: 1024 # chunk size for text splitting
5+
chunk_overlap: 100 # chunk overlap for text splitting
6+
search: # web search configuration
7+
enabled: false # whether to enable web search
8+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
9+
quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
10+
enabled: false
11+
partition: # graph partition configuration
12+
method: anchor_bfs # partition method
13+
method_params:
14+
anchor_type: image # node type to select anchor nodes
15+
max_units_per_community: 10 # atomic partition, one node or edge per community
16+
generate:
17+
mode: vqa # atomic, aggregated, multi_hop, cot, vqa
18+
data_format: ChatML # Alpaca, Sharegpt, ChatML

graphgen/generate.py

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -72,24 +72,11 @@ def main():
7272

7373
graph_gen.search(search_config=config["search"])
7474

75-
# Use pipeline according to the output data type
76-
if mode in ["atomic", "aggregated", "multi_hop"]:
77-
logger.info("Generation mode set to '%s'. Start generation.", mode)
78-
if "quiz_and_judge" in config and config["quiz_and_judge"]["enabled"]:
79-
graph_gen.quiz_and_judge(quiz_and_judge_config=config["quiz_and_judge"])
80-
else:
81-
logger.warning(
82-
"Quiz and Judge strategy is disabled. Edge sampling falls back to random."
83-
)
84-
assert (
85-
config["partition"]["method"] == "ece"
86-
and "method_params" in config["partition"]
87-
), "Only ECE partition with edge sampling is supported."
88-
config["partition"]["method_params"]["edge_sampling"] = "random"
89-
elif mode == "cot":
90-
logger.info("Generation mode set to 'cot'. Start generation.")
91-
else:
92-
raise ValueError(f"Unsupported output data type: {mode}")
75+
if config.get("quiz_and_judge", {}).get("enabled"):
76+
graph_gen.quiz_and_judge(quiz_and_judge_config=config["quiz_and_judge"])
77+
78+
# TODO: add data filtering step here in the future
79+
# graph_gen.filter(filter_config=config["filter"])
9380

9481
graph_gen.generate(
9582
partition_config=config["partition"],

0 commit comments

Comments
 (0)