Skip to content

Commit 7cc8212

Browse files
authored
Feature/improve evaluation prompts (#147)
* refine: improve accuracy evaluation prompt definitions * feat: add English prompts for consistency evaluation * fix: fix too long prompt line * fix: change hardcode language to auto detect * fix: fix too long prompt line
1 parent afaafff commit 7cc8212

File tree

3 files changed

+228
-46
lines changed

3 files changed

+228
-46
lines changed

graphgen/models/evaluator/kg/consistency_evaluator.py

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,9 @@
66
from graphgen.bases import BaseGraphStorage, BaseKVStorage, BaseLLMWrapper
77
from graphgen.bases.datatypes import Chunk
88
from graphgen.templates.evaluation.kg.consistency_evaluation import (
9-
ENTITY_DESCRIPTION_CONFLICT_PROMPT,
10-
ENTITY_EXTRACTION_PROMPT,
11-
ENTITY_TYPE_CONFLICT_PROMPT,
12-
RELATION_CONFLICT_PROMPT,
9+
CONSISTENCY_EVALUATION_PROMPT,
1310
)
14-
from graphgen.utils import logger
11+
from graphgen.utils import detect_main_language, logger
1512

1613

1714
class ConsistencyEvaluator:
@@ -194,7 +191,9 @@ def _extract_entity_from_chunk(
194191
# Clean entity_id: remove surrounding quotes if present
195192
clean_entity_id = self._clean_entity_id(entity_id)
196193

197-
prompt = ENTITY_EXTRACTION_PROMPT.format(
194+
# Detect language and get appropriate prompt
195+
lang = detect_main_language(chunk.content)
196+
prompt = CONSISTENCY_EVALUATION_PROMPT[lang]["ENTITY_EXTRACTION"].format(
198197
entity_name=clean_entity_id,
199198
chunk_content=chunk.content[:2000]
200199
if chunk.content
@@ -270,8 +269,11 @@ def _check_entity_type_consistency(
270269
if entity_type
271270
]
272271

273-
prompt = ENTITY_TYPE_CONFLICT_PROMPT.format(
274-
entity_name=entity_id, type_extractions="\n".join(type_list)
272+
# Detect language from type extraction text
273+
type_text = "\n".join(type_list)
274+
lang = detect_main_language(type_text)
275+
prompt = CONSISTENCY_EVALUATION_PROMPT[lang]["ENTITY_TYPE_CONFLICT"].format(
276+
entity_name=entity_id, type_extractions=type_text
275277
)
276278

277279
response = asyncio.run(self.llm_client.generate_answer(prompt))
@@ -313,8 +315,11 @@ def _check_entity_description_consistency(
313315
for chunk_id, description in valid_descriptions.items()
314316
]
315317

316-
prompt = ENTITY_DESCRIPTION_CONFLICT_PROMPT.format(
317-
entity_name=entity_id, descriptions="\n".join(desc_list)
318+
# Detect language from description text
319+
desc_text = "\n".join(desc_list)
320+
lang = detect_main_language(desc_text)
321+
prompt = CONSISTENCY_EVALUATION_PROMPT[lang]["ENTITY_DESCRIPTION_CONFLICT"].format(
322+
entity_name=entity_id, descriptions=desc_text
318323
)
319324

320325
response = asyncio.run(self.llm_client.generate_answer(prompt))
@@ -351,10 +356,13 @@ def _check_relation_consistency(
351356
if relation
352357
]
353358

354-
prompt = RELATION_CONFLICT_PROMPT.format(
359+
# Detect language from relation description text
360+
rel_text = "\n".join(rel_list)
361+
lang = detect_main_language(rel_text)
362+
prompt = CONSISTENCY_EVALUATION_PROMPT[lang]["RELATION_CONFLICT"].format(
355363
source_entity=src_id,
356364
target_entity=dst_id,
357-
relation_descriptions="\n".join(rel_list),
365+
relation_descriptions=rel_text,
358366
)
359367

360368
response = asyncio.run(self.llm_client.generate_answer(prompt))

graphgen/templates/evaluation/kg/accuracy_evaluation.py

Lines changed: 76 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,27 @@
11
ENTITY_EVALUATION_PROMPT_ZH = """你是一个知识图谱质量评估专家。你的任务是从给定的文本块和提取的实体列表,评估实体提取的质量。
22
33
评估维度:
4-
1. ACCURACY (准确性, 权重: 40%): 提取的实体是否正确,是否有误提取或错误识别
5-
2. COMPLETENESS (完整性, 权重: 40%): 是否遗漏了文本中的重要实体
6-
3. PRECISION (精确性, 权重: 20%): 提取的实体是否精确,命名是否准确
4+
1. ACCURACY (准确性, 权重: 40%): 提取的实体是否真实存在于文本中,是否存在误提取(False Positive)
5+
- 检查:实体是否在文本中实际出现,是否将非实体文本误识别为实体
6+
- 示例:文本提到"蛋白质A",但提取了文本中不存在的"蛋白质B" → 准确性低
7+
- 示例:将"研究显示"这样的非实体短语提取为实体 → 准确性低
8+
9+
2. COMPLETENESS (完整性, 权重: 40%): 是否遗漏了文本中的重要实体(Recall)
10+
- 检查:文本中的重要实体是否都被提取,是否存在遗漏(False Negative)
11+
- 示例:文本提到5个重要蛋白质,但只提取了3个 → 完整性低
12+
- 示例:所有关键实体都被提取 → 完整性高
13+
14+
3. PRECISION (精确性, 权重: 20%): 提取的实体命名是否精确、边界是否准确、类型是否正确
15+
- 检查:实体名称是否完整准确,边界是否正确,实体类型分类是否正确
16+
- 示例:应提取"人类胰岛素受体蛋白",但只提取了"胰岛素" → 精确性低(边界不准确)
17+
- 示例:应分类为"蛋白质",但分类为"基因" → 精确性低(类型错误)
18+
- 示例:应提取"COVID-19",但提取了"冠状病毒" → 精确性低(命名不够精确)
719
820
评分标准(每个维度 0-1 分):
9-
- EXCELLENT (0.8-1.0): 高质量提取
10-
- GOOD (0.6-0.79): 良好质量,有少量问题
11-
- ACCEPTABLE (0.4-0.59): 可接受,有明显问题
12-
- POOR (0.0-0.39): 质量差,需要改进
21+
- EXCELLENT (0.8-1.0): 高质量提取,错误率 < 20%
22+
- GOOD (0.6-0.79): 良好质量,有少量问题,错误率 20-40%
23+
- ACCEPTABLE (0.4-0.59): 可接受,有明显问题,错误率 40-60%
24+
- POOR (0.0-0.39): 质量差,需要改进,错误率 > 60%
1325
1426
综合评分 = 0.4 × Accuracy + 0.4 × Completeness + 0.2 × Precision
1527
@@ -38,15 +50,27 @@
3850
Your task is to evaluate the quality of entity extraction from a given text block and extracted entity list.
3951
4052
Evaluation Dimensions:
41-
1. ACCURACY (Weight: 40%): Whether the extracted entities are correct, and if there are any false extractions or misidentifications
42-
2. COMPLETENESS (Weight: 40%): Whether important entities from the text are missing
43-
3. PRECISION (Weight: 20%): Whether the extracted entities are precise and accurately named
53+
1. ACCURACY (Weight: 40%): Whether the extracted entities actually exist in the text, and if there are any false extractions (False Positives)
54+
- Check: Do entities actually appear in the text? Are non-entity phrases incorrectly identified as entities?
55+
- Example: Text mentions "Protein A", but "Protein B" (not in text) is extracted → Low accuracy
56+
- Example: Phrases like "research shows" are extracted as entities → Low accuracy
57+
58+
2. COMPLETENESS (Weight: 40%): Whether important entities from the text are missing (Recall, False Negatives)
59+
- Check: Are all important entities from the text extracted? Are there any omissions?
60+
- Example: Text mentions 5 important proteins, but only 3 are extracted → Low completeness
61+
- Example: All key entities are extracted → High completeness
62+
63+
3. PRECISION (Weight: 20%): Whether extracted entities are precisely named, have correct boundaries, and correct types
64+
- Check: Are entity names complete and accurate? Are boundaries correct? Are entity types correctly classified?
65+
- Example: Should extract "Human Insulin Receptor Protein", but only "Insulin" is extracted → Low precision (incorrect boundary)
66+
- Example: Should be classified as "Protein", but classified as "Gene" → Low precision (incorrect type)
67+
- Example: Should extract "COVID-19", but "Coronavirus" is extracted → Low precision (naming not precise enough)
4468
4569
Scoring Criteria (0-1 scale for each dimension):
46-
- EXCELLENT (0.8-1.0): High-quality extraction
47-
- GOOD (0.6-0.79): Good quality with minor issues
48-
- ACCEPTABLE (0.4-0.59): Acceptable with noticeable issues
49-
- POOR (0.0-0.39): Poor quality, needs improvement
70+
- EXCELLENT (0.8-1.0): High-quality extraction, error rate < 20%
71+
- GOOD (0.6-0.79): Good quality with minor issues, error rate 20-40%
72+
- ACCEPTABLE (0.4-0.59): Acceptable with noticeable issues, error rate 40-60%
73+
- POOR (0.0-0.39): Poor quality, needs improvement, error rate > 60%
5074
5175
Overall Score = 0.4 × Accuracy + 0.4 × Completeness + 0.2 × Precision
5276
@@ -74,15 +98,27 @@
7498
RELATION_EVALUATION_PROMPT_ZH = """你是一个知识图谱质量评估专家。你的任务是从给定的文本块和提取的关系列表,评估关系抽取的质量。
7599
76100
评估维度:
77-
1. ACCURACY (准确性, 权重: 40%): 提取的关系是否正确,关系描述是否准确
78-
2. COMPLETENESS (完整性, 权重: 40%): 是否遗漏了文本中的重要关系
79-
3. PRECISION (精确性, 权重: 20%): 关系描述是否精确,是否过于宽泛
101+
1. ACCURACY (准确性, 权重: 40%): 提取的关系是否真实存在于文本中,是否存在误提取(False Positive)
102+
- 检查:关系是否在文本中实际表达,是否将不存在的关系误识别为关系
103+
- 示例:文本中A和B没有关系,但提取了"A-作用于->B" → 准确性低
104+
- 示例:将文本中的并列关系误识别为因果关系 → 准确性低
105+
106+
2. COMPLETENESS (完整性, 权重: 40%): 是否遗漏了文本中的重要关系(Recall)
107+
- 检查:文本中表达的重要关系是否都被提取,是否存在遗漏(False Negative)
108+
- 示例:文本明确表达了5个关系,但只提取了3个 → 完整性低
109+
- 示例:所有关键关系都被提取 → 完整性高
110+
111+
3. PRECISION (精确性, 权重: 20%): 关系描述是否精确,关系类型是否正确,是否过于宽泛
112+
- 检查:关系类型是否准确,关系描述是否具体,是否使用了过于宽泛的关系类型
113+
- 示例:应提取"抑制"关系,但提取了"影响"关系 → 精确性低(类型不够精确)
114+
- 示例:应提取"直接结合",但提取了"相关" → 精确性低(描述过于宽泛)
115+
- 示例:关系方向是否正确(如"A激活B" vs "B被A激活")→ 精确性检查
80116
81117
评分标准(每个维度 0-1 分):
82-
- EXCELLENT (0.8-1.0): 高质量提取
83-
- GOOD (0.6-0.79): 良好质量,有少量问题
84-
- ACCEPTABLE (0.4-0.59): 可接受,有明显问题
85-
- POOR (0.0-0.39): 质量差,需要改进
118+
- EXCELLENT (0.8-1.0): 高质量提取,错误率 < 20%
119+
- GOOD (0.6-0.79): 良好质量,有少量问题,错误率 20-40%
120+
- ACCEPTABLE (0.4-0.59): 可接受,有明显问题,错误率 40-60%
121+
- POOR (0.0-0.39): 质量差,需要改进,错误率 > 60%
86122
87123
综合评分 = 0.4 × Accuracy + 0.4 × Completeness + 0.2 × Precision
88124
@@ -111,15 +147,27 @@
111147
Your task is to evaluate the quality of relation extraction from a given text block and extracted relation list.
112148
113149
Evaluation Dimensions:
114-
1. ACCURACY (Weight: 40%): Whether the extracted relations are correct and the relation descriptions are accurate
115-
2. COMPLETENESS (Weight: 40%): Whether important relations from the text are missing
116-
3. PRECISION (Weight: 20%): Whether the relation descriptions are precise and not overly broad
150+
1. ACCURACY (Weight: 40%): Whether the extracted relations actually exist in the text, and if there are any false extractions (False Positives)
151+
- Check: Do relations actually appear in the text? Are non-existent relations incorrectly identified?
152+
- Example: Text shows no relation between A and B, but "A-acts_on->B" is extracted → Low accuracy
153+
- Example: A parallel relationship in text is misidentified as a causal relationship → Low accuracy
154+
155+
2. COMPLETENESS (Weight: 40%): Whether important relations from the text are missing (Recall, False Negatives)
156+
- Check: Are all important relations expressed in the text extracted? Are there any omissions?
157+
- Example: Text explicitly expresses 5 relations, but only 3 are extracted → Low completeness
158+
- Example: All key relations are extracted → High completeness
159+
160+
3. PRECISION (Weight: 20%): Whether relation descriptions are precise, relation types are correct, and not overly broad
161+
- Check: Are relation types accurate? Are relation descriptions specific? Are overly broad relation types used?
162+
- Example: Should extract "inhibits" relation, but "affects" is extracted → Low precision (type not precise enough)
163+
- Example: Should extract "directly binds", but "related" is extracted → Low precision (description too broad)
164+
- Example: Is relation direction correct (e.g., "A activates B" vs "B is activated by A") → Precision check
117165
118166
Scoring Criteria (0-1 scale for each dimension):
119-
- EXCELLENT (0.8-1.0): High-quality extraction
120-
- GOOD (0.6-0.79): Good quality with minor issues
121-
- ACCEPTABLE (0.4-0.59): Acceptable with noticeable issues
122-
- POOR (0.0-0.39): Poor quality, needs improvement
167+
- EXCELLENT (0.8-1.0): High-quality extraction, error rate < 20%
168+
- GOOD (0.6-0.79): Good quality with minor issues, error rate 20-40%
169+
- ACCEPTABLE (0.4-0.59): Acceptable with noticeable issues, error rate 40-60%
170+
- POOR (0.0-0.39): Poor quality, needs improvement, error rate > 60%
123171
124172
Overall Score = 0.4 × Accuracy + 0.4 × Completeness + 0.2 × Precision
125173

graphgen/templates/evaluation/kg/consistency_evaluation.py

Lines changed: 132 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
ENTITY_TYPE_CONFLICT_PROMPT = """你是一个知识图谱一致性评估专家。你的任务是判断同一个实体在不同文本块中被提取为不同的类型,是否存在语义冲突。
1+
ENTITY_TYPE_CONFLICT_PROMPT_ZH = """你是一个知识图谱一致性评估专家。你的任务是判断同一个实体在不同文本块中被提取为不同的类型,是否存在语义冲突。
22
33
实体名称:{entity_name}
44
@@ -21,7 +21,38 @@
2121
}}
2222
"""
2323

24-
ENTITY_DESCRIPTION_CONFLICT_PROMPT = """你是一个知识图谱一致性评估专家。你的任务是判断同一个实体在不同文本块中的描述是否存在语义冲突。
24+
ENTITY_TYPE_CONFLICT_PROMPT_EN = (
25+
"""You are a Knowledge Graph Consistency Assessment Expert. """
26+
"""Your task is to determine whether there are semantic conflicts """
27+
"""when the same entity is extracted as different types in different text blocks.
28+
29+
Entity Name: {entity_name}
30+
31+
Type extraction results from different text blocks:
32+
{type_extractions}
33+
34+
Preset entity type list (for reference):
35+
concept, date, location, keyword, organization, person, event, work, nature, """
36+
"""artificial, science, technology, mission, gene
37+
38+
Please determine whether these types have semantic conflicts """
39+
"""(i.e., whether they describe the same category of things, """
40+
"""or if there are contradictions).
41+
Note: If types are just different expressions of the same concept """
42+
"""(such as concept and keyword), it may not be considered a serious conflict.
43+
44+
Please return in JSON format:
45+
{{
46+
"has_conflict": <true/false>,
47+
"conflict_severity": <float between 0-1, where 0 means no conflict, 1 means severe conflict>,
48+
"conflict_reasoning": "<reasoning for conflict judgment>",
49+
"conflicting_types": ["<pairs of conflicting types>"],
50+
"recommended_type": "<if there is a conflict, the recommended correct type (must be one of the preset types)>"
51+
}}
52+
"""
53+
)
54+
55+
ENTITY_DESCRIPTION_CONFLICT_PROMPT_ZH = """你是一个知识图谱一致性评估专家。你的任务是判断同一个实体在不同文本块中的描述是否存在语义冲突。
2556
2657
实体名称:{entity_name}
2758
@@ -40,7 +71,32 @@
4071
}}
4172
"""
4273

43-
RELATION_CONFLICT_PROMPT = """你是一个知识图谱一致性评估专家。你的任务是判断同一对实体在不同文本块中的关系描述是否存在语义冲突。
74+
ENTITY_DESCRIPTION_CONFLICT_PROMPT_EN = (
75+
"""You are a Knowledge Graph Consistency Assessment Expert. """
76+
"""Your task is to determine whether there are semantic conflicts """
77+
"""in the descriptions of the same entity across different text blocks.
78+
79+
Entity Name: {entity_name}
80+
81+
Descriptions from different text blocks:
82+
{descriptions}
83+
84+
Please determine whether these descriptions have semantic conflicts """
85+
"""(i.e., whether they describe the same entity, """
86+
"""or if there is contradictory information).
87+
88+
Please return in JSON format:
89+
{{
90+
"has_conflict": <true/false>,
91+
"conflict_severity": <float between 0-1>,
92+
"conflict_reasoning": "<reasoning for conflict judgment>",
93+
"conflicting_descriptions": ["<pairs of conflicting descriptions>"],
94+
"conflict_details": "<specific conflict content>"
95+
}}
96+
"""
97+
)
98+
99+
RELATION_CONFLICT_PROMPT_ZH = """你是一个知识图谱一致性评估专家。你的任务是判断同一对实体在不同文本块中的关系描述是否存在语义冲突。
44100
45101
实体对:{source_entity} -> {target_entity}
46102
@@ -58,7 +114,29 @@
58114
}}
59115
"""
60116

61-
ENTITY_EXTRACTION_PROMPT = """从以下文本块中提取指定实体的类型和描述。
117+
RELATION_CONFLICT_PROMPT_EN = (
118+
"""You are a Knowledge Graph Consistency Assessment Expert. """
119+
"""Your task is to determine whether there are semantic conflicts """
120+
"""in the relation descriptions of the same entity pair across different text blocks.
121+
122+
Entity Pair: {source_entity} -> {target_entity}
123+
124+
Relation descriptions from different text blocks:
125+
{relation_descriptions}
126+
127+
Please determine whether these relation descriptions have semantic conflicts.
128+
129+
Please return in JSON format:
130+
{{
131+
"has_conflict": <true/false>,
132+
"conflict_severity": <float between 0-1>,
133+
"conflict_reasoning": "<reasoning for conflict judgment>",
134+
"conflicting_relations": ["<pairs of conflicting relation descriptions>"]
135+
}}
136+
"""
137+
)
138+
139+
ENTITY_EXTRACTION_PROMPT_ZH = """从以下文本块中提取指定实体的类型和描述。
62140
63141
**重要**:你只需要提取指定的实体,不要提取其他实体。
64142
@@ -96,7 +174,55 @@
96174
}}
97175
"""
98176

177+
ENTITY_EXTRACTION_PROMPT_EN = """Extract the type and description of the specified entity from the following text block.
178+
179+
**Important**: You should only extract the specified entity, do not extract other entities.
180+
181+
Entity Name: {entity_name}
182+
183+
Text Block:
184+
{chunk_content}
185+
186+
Please find and extract the following information for **this entity only** (entity name: {entity_name}) from the text block:
187+
188+
1. entity_type: Entity type, must be one of the following preset types (lowercase):
189+
- concept: concept
190+
- date: date
191+
- location: location
192+
- keyword: keyword
193+
- organization: organization
194+
- person: person
195+
- event: event
196+
- work: work
197+
- nature: nature
198+
- artificial: artificial
199+
- science: science
200+
- technology: technology
201+
- mission: mission
202+
- gene: gene
203+
204+
If the type cannot be determined, please use "concept" as the default value.
205+
206+
2. description: Entity description (briefly describe the role and characteristics of this entity in the text)
207+
208+
Please return in JSON format:
209+
{{
210+
"entity_type": "<entity type (must be one of the preset types above)>",
211+
"description": "<entity description>"
212+
}}
213+
"""
214+
99215
CONSISTENCY_EVALUATION_PROMPT = {
100-
"en": "",
101-
"zh": ""
216+
"zh": {
217+
"ENTITY_TYPE_CONFLICT": ENTITY_TYPE_CONFLICT_PROMPT_ZH,
218+
"ENTITY_DESCRIPTION_CONFLICT": ENTITY_DESCRIPTION_CONFLICT_PROMPT_ZH,
219+
"RELATION_CONFLICT": RELATION_CONFLICT_PROMPT_ZH,
220+
"ENTITY_EXTRACTION": ENTITY_EXTRACTION_PROMPT_ZH,
221+
},
222+
"en": {
223+
"ENTITY_TYPE_CONFLICT": ENTITY_TYPE_CONFLICT_PROMPT_EN,
224+
"ENTITY_DESCRIPTION_CONFLICT": ENTITY_DESCRIPTION_CONFLICT_PROMPT_EN,
225+
"RELATION_CONFLICT": RELATION_CONFLICT_PROMPT_EN,
226+
"ENTITY_EXTRACTION": ENTITY_EXTRACTION_PROMPT_EN,
227+
},
102228
}

0 commit comments

Comments
 (0)