Skip to content

Commit 146d604

Browse files
e06084actions-user
andauthored
feat: add Instruction Quality Evaluation (#313)
* feat: add Instruction Quality Evaluation * 📚 Auto-update metrics documentation --------- Co-authored-by: GitHub Action <[email protected]>
1 parent 1de03d7 commit 146d604

File tree

7 files changed

+1540
-0
lines changed

7 files changed

+1540
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
"""
2+
Instruction Quality Evaluation Metrics
3+
4+
This module provides LLM-based evaluators for assessing instruction quality
5+
in SFT (Supervised Fine-Tuning) datasets, specifically focusing on:
6+
7+
1. Instruction Clarity - Evaluates how clear and well-defined instructions are
8+
2. Task Difficulty - Assesses the complexity and difficulty level of tasks
9+
10+
These metrics are based on recent research in instruction following and
11+
LLM training data quality assessment.
12+
"""
13+
14+
from dingo.model.llm.instruction_quality.llm_instruction_clarity import LLMInstructionClarity
15+
from dingo.model.llm.instruction_quality.llm_task_difficulty import LLMTaskDifficulty
16+
17+
__all__ = [
18+
"LLMInstructionClarity",
19+
"LLMTaskDifficulty",
20+
]
Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
"""
2+
Instruction Clarity Evaluator - 指令清晰度评估器
3+
4+
Based on recent research:
5+
- IFEval: Instruction Following Evaluation (Google, 2023)
6+
- Self-Instruct (University of Washington, 2023)
7+
- Alpaca: A Strong, Replicable Instruction-Following Model (Stanford, 2023)
8+
9+
评估维度:
10+
1. Self-Descriptiveness: 指令是否自包含,无需额外上下文
11+
2. Consistency: 指令内部是否一致,无矛盾
12+
3. Specificity: 指令是否具体明确,避免歧义
13+
4. Completeness: 指令是否完整,包含所有必要信息
14+
"""
15+
16+
from dingo.io.output.eval_detail import EvalDetail
17+
from dingo.model import Model
18+
from dingo.model.llm.base_openai import BaseOpenAI
19+
from dingo.utils import log
20+
21+
22+
@Model.llm_register("LLMInstructionClarity")
23+
class LLMInstructionClarity(BaseOpenAI):
24+
"""
25+
LLM-based instruction clarity evaluator
26+
27+
评估指令的清晰度,包括:
28+
- 自描述性:是否包含足够信息
29+
- 一致性:内部是否有矛盾
30+
- 具体性:是否明确具体
31+
- 完整性:是否包含所有必要信息
32+
"""
33+
34+
# Metadata for documentation generation
35+
_metric_info = {
36+
"category": "SFT Data Assessment Metrics",
37+
"quality_dimension": "INSTRUCTION_CLARITY",
38+
"metric_name": "LLMInstructionClarity",
39+
"description": "Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness",
40+
"paper_source": "IFEval (Google, 2023), Self-Instruct (UW, 2023)",
41+
"evaluation_results": "Returns clarity score (0-10) and detailed analysis"
42+
}
43+
44+
prompt = """
45+
# Role
46+
You are an expert in evaluating instruction quality for Large Language Model training data.
47+
48+
# Task
49+
Evaluate the clarity of the given instruction across four dimensions.
50+
51+
# Evaluation Dimensions
52+
53+
## 1. Self-Descriptiveness (自描述性)
54+
**Definition**: Does the instruction contain sufficient information to be understood without additional context?
55+
56+
**Scoring**:
57+
- **High (2.5)**: Complete self-contained instruction with all necessary details
58+
- Example: "Write a Python function that takes a list of integers and returns the sum of all even numbers. Include docstring and type hints."
59+
- **Medium (1.5)**: Mostly clear but may need minor assumptions
60+
- Example: "Write a function to sum even numbers in a list."
61+
- **Low (0.5)**: Requires significant external context or assumptions
62+
- Example: "Do that thing with the numbers."
63+
64+
## 2. Consistency (一致性)
65+
**Definition**: Are all parts of the instruction aligned without contradictions?
66+
67+
**Scoring**:
68+
- **High (2.5)**: Perfectly consistent throughout
69+
- Example: "Write a formal academic essay on climate change using APA citation style and maintain a professional tone."
70+
- **Medium (1.5)**: Minor inconsistencies that don't fundamentally conflict
71+
- Example: "Write a casual blog post but use academic references."
72+
- **Low (0.5)**: Major contradictions
73+
- Example: "Write a 500-word essay in under 100 words."
74+
75+
## 3. Specificity (具体性)
76+
**Definition**: Is the instruction concrete and unambiguous?
77+
78+
**Scoring**:
79+
- **High (2.5)**: Very specific with clear success criteria
80+
- Example: "Generate exactly 5 creative product names for an eco-friendly water bottle. Each name should be 2-3 words and include at least one nature-related term."
81+
- **Medium (1.5)**: Somewhat specific but allows interpretation
82+
- Example: "Generate some creative names for a water bottle."
83+
- **Low (0.5)**: Vague and ambiguous
84+
- Example: "Make something cool."
85+
86+
## 4. Completeness (完整性)
87+
**Definition**: Does the instruction include all necessary information for task completion?
88+
89+
**Scoring**:
90+
- **High (2.5)**: All required elements specified (input, output, constraints, format)
91+
- Example: "Given a JSON file with user data, extract all email addresses, validate them using regex, and output to a CSV file with columns: name, email, valid_status."
92+
- **Medium (1.5)**: Most elements present but some details missing
93+
- Example: "Extract email addresses from a file and validate them."
94+
- **Low (0.5)**: Critical information missing
95+
- Example: "Process the data."
96+
97+
# Scoring System
98+
- **Total Score**: 0-10 (sum of all four dimensions, each worth 2.5 points)
99+
- **Threshold**: Default 6.0 (instructions below this score are considered unclear)
100+
101+
# Output Format
102+
Return JSON only:
103+
```json
104+
{
105+
"score": 8.5,
106+
"dimensions": {
107+
"self_descriptiveness": 2.5,
108+
"consistency": 2.0,
109+
"specificity": 2.0,
110+
"completeness": 2.0
111+
},
112+
"issues": [],
113+
"strengths": ["Clear task definition", "Well-specified output format"],
114+
"suggestions": ["Could specify tone/style more explicitly"],
115+
"reason": "High-quality instruction with clear task definition and well-specified constraints. Minor improvement: explicitly specify the desired tone."
116+
}
117+
```
118+
119+
# Important Rules
120+
1. Be strict but fair - real-world instructions aren't always perfect
121+
2. Focus on whether the instruction enables successful task completion
122+
3. Consider the instruction type (creative tasks may be intentionally open-ended)
123+
4. Empty or extremely vague instructions should score 0-2
124+
5. Professional SFT-quality instructions should score 7+
125+
126+
# Examples
127+
128+
**Example 1 - Excellent Instruction (Score: 9.5)**
129+
Input: "Write a Python function named `calculate_discount` that takes two parameters: original_price (float) and discount_percentage (float, 0-100). Return the final price after applying the discount, rounded to 2 decimal places. Include input validation to ensure prices are positive and discounts are between 0-100. Add comprehensive docstring with examples."
130+
131+
Output:
132+
```json
133+
{
134+
"score": 9.5,
135+
"dimensions": {
136+
"self_descriptiveness": 2.5,
137+
"consistency": 2.5,
138+
"specificity": 2.5,
139+
"completeness": 2.0
140+
},
141+
"issues": [],
142+
"strengths": [
143+
"Specific function name and parameters",
144+
"Clear input/output specifications",
145+
"Validation requirements specified",
146+
"Format requirements (rounding) included"
147+
],
148+
"suggestions": [
149+
"Could specify return type for type hints"
150+
],
151+
"reason": "Excellent instruction with comprehensive details. Only minor omission: explicit mention of return type annotation."
152+
}
153+
```
154+
155+
**Example 2 - Poor Instruction (Score: 2.0)**
156+
Input: "写个代码"
157+
158+
Output:
159+
```json
160+
{
161+
"score": 2.0,
162+
"dimensions": {
163+
"self_descriptiveness": 0.5,
164+
"consistency": 0.5,
165+
"specificity": 0.5,
166+
"completeness": 0.5
167+
},
168+
"issues": [
169+
"No programming language specified",
170+
"No task description",
171+
"No input/output specification",
172+
"No context provided"
173+
],
174+
"strengths": [],
175+
"suggestions": [
176+
"Specify programming language",
177+
"Describe what the code should do",
178+
"Define expected inputs and outputs",
179+
"Add any constraints or requirements"
180+
],
181+
"reason": "Extremely vague instruction lacking all essential information. Impossible to determine what code is needed without additional context."
182+
}
183+
```
184+
185+
**Example 3 - Moderate Instruction (Score: 6.5)**
186+
Input: "Summarize this article in a few sentences."
187+
188+
Output:
189+
```json
190+
{
191+
"score": 6.5,
192+
"dimensions": {
193+
"self_descriptiveness": 1.5,
194+
"consistency": 2.0,
195+
"specificity": 1.5,
196+
"completeness": 1.5
197+
},
198+
"issues": [
199+
"'Few sentences' is ambiguous (2? 3? 5?)",
200+
"No article content provided (assumes context)",
201+
"No specification of summary style/focus"
202+
],
203+
"strengths": [
204+
"Clear task (summarization)",
205+
"No internal contradictions"
206+
],
207+
"suggestions": [
208+
"Specify exact number of sentences (e.g., '3-5 sentences')",
209+
"Include the article content or reference",
210+
"Optionally specify summary focus (key findings, main argument, etc.)"
211+
],
212+
"reason": "Decent instruction with clear intent but lacks precision. Needs more specific constraints and assumes article context is available."
213+
}
214+
```
215+
216+
# Now evaluate this instruction:
217+
"""
218+
219+
@classmethod
220+
def process_response(cls, response: str) -> EvalDetail:
221+
"""处理 LLM 响应并生成评估结果"""
222+
import json
223+
224+
log.info(f"LLM Response: {response}")
225+
result = EvalDetail(metric=cls.__name__)
226+
227+
try:
228+
# 解析 JSON 响应
229+
# 移除可能的 markdown 代码块标记
230+
response = response.strip()
231+
if response.startswith("```json"):
232+
response = response[7:]
233+
if response.startswith("```"):
234+
response = response[3:]
235+
if response.endswith("```"):
236+
response = response[:-3]
237+
response = response.strip()
238+
239+
parsed = json.loads(response)
240+
241+
# 提取分数和维度信息
242+
score = float(parsed.get("score", 0))
243+
dimensions = parsed.get("dimensions", {})
244+
issues = parsed.get("issues", [])
245+
strengths = parsed.get("strengths", [])
246+
suggestions = parsed.get("suggestions", [])
247+
reason = parsed.get("reason", "")
248+
249+
# 构建详细的 reason
250+
detailed_reason = f"指令清晰度评分: {score}/10\n\n"
251+
detailed_reason += "维度得分:\n"
252+
detailed_reason += f" - 自描述性: {dimensions.get('self_descriptiveness', 0)}/2.5\n"
253+
detailed_reason += f" - 一致性: {dimensions.get('consistency', 0)}/2.5\n"
254+
detailed_reason += f" - 具体性: {dimensions.get('specificity', 0)}/2.5\n"
255+
detailed_reason += f" - 完整性: {dimensions.get('completeness', 0)}/2.5\n\n"
256+
257+
if strengths:
258+
detailed_reason += "优点:\n"
259+
for s in strengths:
260+
detailed_reason += f" ✓ {s}\n"
261+
detailed_reason += "\n"
262+
263+
if issues:
264+
detailed_reason += "问题:\n"
265+
for i in issues:
266+
detailed_reason += f" ✗ {i}\n"
267+
detailed_reason += "\n"
268+
269+
if suggestions:
270+
detailed_reason += "改进建议:\n"
271+
for s in suggestions:
272+
detailed_reason += f" → {s}\n"
273+
detailed_reason += "\n"
274+
275+
detailed_reason += f"总结: {reason}"
276+
277+
# 设置结果
278+
result.score = score
279+
result.reason = [detailed_reason]
280+
281+
# 判断是否通过(默认阈值 6.0)
282+
threshold = 6.0
283+
if hasattr(cls, 'dynamic_config') and cls.dynamic_config.parameters:
284+
threshold = cls.dynamic_config.parameters.get('threshold', 6.0)
285+
286+
if score >= threshold:
287+
result.status = False
288+
result.label = ["QUALITY_GOOD.INSTRUCTION_CLARITY_PASS"]
289+
else:
290+
result.status = True
291+
result.label = ["QUALITY_BAD.INSTRUCTION_CLARITY_FAIL"]
292+
293+
except json.JSONDecodeError as e:
294+
log.error(f"Failed to parse JSON response: {e}")
295+
result.status = True
296+
result.score = 0
297+
result.label = ["QUALITY_BAD.INSTRUCTION_CLARITY_ERROR"]
298+
result.reason = [f"评估失败: JSON 解析错误 - {str(e)}"]
299+
except Exception as e:
300+
log.error(f"Error processing response: {e}")
301+
result.status = True
302+
result.score = 0
303+
result.label = ["QUALITY_BAD.INSTRUCTION_CLARITY_ERROR"]
304+
result.reason = [f"评估失败: {str(e)}"]
305+
306+
return result

0 commit comments

Comments
 (0)