Skip to content

Commit 7b89816

Browse files
Merge pull request #43 from open-sciencelab/output_format
feat: support alpaca, sharegpt & chatml output format
2 parents 2d6f53d + 3b4eb75 commit 7b89816

26 files changed

+819
-1394
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ Here is post-training result which **over 50% SFT data** comes from GraphGen and
5656
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
5757
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
5858

59+
After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [xtuner](https://github.com/InternLM/xtuner) to finetune your LLMs.
60+
5961
## 📌 Latest Updates
6062

6163
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.

README_ZH.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ GraphGen 是一个基于知识图谱引导的合成数据生成框架。请查
5757
GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期望校准误差指标识别大语言模型中的知识缺口,优先生成针对高价值长尾知识的问答对。
5858
此外,GraphGen 采用多跳邻域采样捕获复杂关系信息,并使用风格控制生成来丰富问答数据的多样性。
5959

60+
在数据生成后,您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)[xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
61+
6062
## 📌 最新更新
6163

6264
- **2025.08.14**:支持利用 Leiden 社区发现算法对知识图谱进行社区划分,合成 CoT 数据。

graphgen/configs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Configs for GraphGen
Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
1-
input_data_type: raw
2-
input_file: resources/input_examples/raw_demo.jsonl
3-
output_data_type: aggregated
4-
tokenizer: cl100k_base
5-
quiz_samples: 2
6-
traverse_strategy:
7-
bidirectional: true
8-
edge_sampling: max_loss
9-
expand_method: max_width
10-
isolated_node_strategy: ignore
11-
max_depth: 5
12-
max_extra_edges: 20
13-
max_tokens: 256
14-
loss_strategy: only_edge
15-
search:
16-
enabled: false
17-
search_types: ["google"]
18-
re_judge: false
1+
input_data_type: raw # raw, chunked
2+
input_file: resources/input_examples/raw_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
3+
output_data_type: aggregated # atomic, aggregated, multi_hop, cot
4+
output_data_format: ChatML # Alpaca, Sharegpt, ChatML
5+
tokenizer: cl100k_base # tokenizer for counting tokens, support tiktoken tokenizer names and local tokenizer path
6+
search: # web search configuration
7+
enabled: false # whether to enable web search
8+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
9+
quiz_and_judge_strategy: # quiz and test whether the LLM masters the knowledge points
10+
enabled: true
11+
quiz_samples: 2 # number of quiz samples to generate
12+
re_judge: false # whether to re-judge the existing quiz samples
13+
traverse_strategy: # strategy for clustering sub-graphs using comprehension loss
14+
bidirectional: true # whether to traverse the graph in both directions
15+
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
16+
expand_method: max_width # expand method, support: max_width, max_depth
17+
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
18+
max_depth: 5 # maximum depth for graph traversal
19+
max_extra_edges: 20 # max edges per direction (if expand_method="max_width")
20+
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
21+
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both
Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
1-
input_data_type: raw
2-
input_file: resources/input_examples/raw_demo.jsonl
3-
output_data_type: atomic
4-
tokenizer: cl100k_base
5-
quiz_samples: 2
6-
traverse_strategy:
7-
bidirectional: true
8-
edge_sampling: max_loss
9-
expand_method: max_width
10-
isolated_node_strategy: ignore
11-
max_depth: 3
12-
max_extra_edges: 5
13-
max_tokens: 256
14-
loss_strategy: only_edge
15-
search:
16-
enabled: false
17-
search_types: ["google"]
18-
re_judge: false
1+
input_data_type: raw # raw, chunked
2+
input_file: resources/input_examples/raw_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
3+
output_data_type: atomic # atomic, aggregated, multi_hop, cot
4+
output_data_format: Alpaca # Alpaca, Sharegpt, ChatML
5+
tokenizer: cl100k_base # tokenizer for counting tokens, support tiktoken tokenizer names and local tokenizer path
6+
search: # web search configuration
7+
enabled: false # whether to enable web search
8+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
9+
quiz_and_judge_strategy: # quiz and test whether the LLM masters the knowledge points
10+
enabled: true
11+
quiz_samples: 2 # number of quiz samples to generate
12+
re_judge: false # whether to re-judge the existing quiz samples
13+
traverse_strategy: # strategy for clustering sub-graphs using comprehension loss
14+
bidirectional: true # whether to traverse the graph in both directions
15+
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
16+
expand_method: max_width # expand method, support: max_width, max_depth
17+
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
18+
max_depth: 3 # maximum depth for graph traversal
19+
max_extra_edges: 5 # max edges per direction (if expand_method="max_width")
20+
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
21+
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both

graphgen/configs/cot_config.yaml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
input_data_type: raw
2-
input_file: resources/input_examples/raw_demo.jsonl
3-
output_data_type: cot
4-
tokenizer: cl100k_base
5-
search:
6-
enabled: false
7-
search_types: []
1+
input_data_type: raw # raw, chunked
2+
input_file: resources/input_examples/raw_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
3+
output_data_type: cot # atomic, aggregated, multi_hop, cot
4+
output_data_format: Sharegpt # Alpaca, Sharegpt, ChatML
5+
tokenizer: cl100k_base # tokenizer for counting tokens, support tiktoken tokenizer names and local tokenizer path
6+
search: # web search configuration
7+
enabled: false # whether to enable web search
8+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
89
method_params:
910
method: leiden
1011
max_size: 20 # Maximum size of communities
Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
1-
input_data_type: raw
2-
input_file: resources/input_examples/raw_demo.jsonl
3-
output_data_type: multi_hop
4-
tokenizer: cl100k_base
5-
quiz_samples: 2
6-
traverse_strategy:
7-
bidirectional: true
8-
edge_sampling: max_loss
9-
expand_method: max_width
10-
isolated_node_strategy: ignore
11-
max_depth: 1
12-
max_extra_edges: 2
13-
max_tokens: 256
14-
loss_strategy: only_edge
15-
search:
16-
enabled: false
17-
search_types: ["google"]
18-
re_judge: false
1+
input_data_type: raw # raw, chunked
2+
input_file: resources/input_examples/raw_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
3+
output_data_type: multi_hop # atomic, aggregated, multi_hop, cot
4+
output_data_format: ChatML # Alpaca, Sharegpt, ChatML
5+
tokenizer: cl100k_base # tokenizer for counting tokens, support tiktoken tokenizer names and local tokenizer path
6+
search: # web search configuration
7+
enabled: false # whether to enable web search
8+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
9+
quiz_and_judge_strategy: # quiz and test whether the LLM masters the knowledge points
10+
enabled: true
11+
quiz_samples: 2 # number of quiz samples to generate
12+
re_judge: false # whether to re-judge the existing quiz samples
13+
traverse_strategy: # strategy for clustering sub-graphs using comprehension loss
14+
bidirectional: true # whether to traverse the graph in both directions
15+
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
16+
expand_method: max_width # expand method, support: max_width, max_depth
17+
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
18+
max_depth: 1 # maximum depth for graph traversal
19+
max_extra_edges: 2 # max edges per direction (if expand_method="max_width")
20+
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
21+
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both

graphgen/generate.py

Lines changed: 15 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@
77
from dotenv import load_dotenv
88

99
from .graphgen import GraphGen
10-
from .models import OpenAIModel, Tokenizer, TraverseStrategy
11-
from .utils import logger, read_file, set_logger
10+
from .utils import logger, set_logger
1211

1312
sys_path = os.path.abspath(os.path.dirname(__file__))
1413

@@ -53,10 +52,8 @@ def main():
5352

5453
with open(args.config_file, "r", encoding="utf-8") as f:
5554
config = yaml.load(f, Loader=yaml.FullLoader)
56-
input_file = config["input_file"]
57-
data = read_file(input_file)
58-
output_data_type = config["output_data_type"]
5955

56+
output_data_type = config["output_data_type"]
6057
unique_id = int(time.time())
6158
set_logger(
6259
os.path.join(
@@ -72,41 +69,26 @@ def main():
7269
),
7370
)
7471

75-
tokenizer_instance = Tokenizer(model_name=config["tokenizer"])
76-
synthesizer_llm_client = OpenAIModel(
77-
model_name=os.getenv("SYNTHESIZER_MODEL"),
78-
api_key=os.getenv("SYNTHESIZER_API_KEY"),
79-
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
80-
tokenizer_instance=tokenizer_instance,
81-
)
82-
trainee_llm_client = OpenAIModel(
83-
model_name=os.getenv("TRAINEE_MODEL"),
84-
api_key=os.getenv("TRAINEE_API_KEY"),
85-
base_url=os.getenv("TRAINEE_BASE_URL"),
86-
tokenizer_instance=tokenizer_instance,
87-
)
88-
89-
graph_gen = GraphGen(
90-
working_dir=working_dir,
91-
unique_id=unique_id,
92-
synthesizer_llm_client=synthesizer_llm_client,
93-
trainee_llm_client=trainee_llm_client,
94-
search_config=config["search"],
95-
tokenizer_instance=tokenizer_instance,
96-
)
72+
graph_gen = GraphGen(working_dir=working_dir, unique_id=unique_id, config=config)
9773

98-
graph_gen.insert(data, config["input_data_type"])
74+
graph_gen.insert()
9975

10076
if config["search"]["enabled"]:
10177
graph_gen.search()
10278

10379
# Use pipeline according to the output data type
10480
if output_data_type in ["atomic", "aggregated", "multi_hop"]:
105-
graph_gen.quiz(max_samples=config["quiz_samples"])
106-
graph_gen.judge(re_judge=config["re_judge"])
107-
traverse_strategy = TraverseStrategy(**config["traverse_strategy"])
108-
traverse_strategy.qa_form = output_data_type
109-
graph_gen.traverse(traverse_strategy=traverse_strategy)
81+
if "quiz_and_judge_strategy" in config and config[
82+
"quiz_and_judge_strategy"
83+
].get("enabled", False):
84+
graph_gen.quiz()
85+
graph_gen.judge()
86+
else:
87+
logger.warning(
88+
"Quiz and Judge strategy is disabled. Edge sampling falls back to random."
89+
)
90+
graph_gen.traverse_strategy.edge_sampling = "random"
91+
graph_gen.traverse()
11092
elif output_data_type == "cot":
11193
graph_gen.generate_reasoning(method_params=config["method_params"])
11294
else:

0 commit comments

Comments
 (0)