feat: add Instruction Quality Evaluation #313

e06084 · 2025-12-23T09:36:02Z

No description provided.

gemini-code-assist · 2025-12-23T09:36:17Z

Summary of Changes

Hello @e06084, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature for evaluating the quality of instructions, which is critical for preparing high-quality datasets for Supervised Fine-Tuning (SFT) of Large Language Models. It provides robust, LLM-based mechanisms to automatically assess both the clarity and the difficulty of instructions, ensuring that training data is well-formed and appropriately challenging. The addition is supported by detailed documentation and practical examples to facilitate adoption and effective use.

Highlights

New Instruction Quality Evaluation Metrics: Introduced two new LLM-based evaluators: LLMInstructionClarity and LLMTaskDifficulty to assess the quality of instructions in SFT (Supervised Fine-Tuning) datasets.
Instruction Clarity Evaluator: The LLMInstructionClarity metric evaluates instructions across four dimensions: self-descriptiveness, consistency, specificity, and completeness, providing a score from 0-10.
Task Difficulty Evaluator: The LLMTaskDifficulty metric assesses task complexity based on cognitive complexity, step complexity, domain knowledge requirements, and constraint density, also providing a score from 0-10 and categorizing tasks into difficulty levels (Easy, Moderate, Hard, Expert).
Comprehensive Documentation: A new guide, instruction_quality_guide.md, has been added, detailing the new metrics, their scoring criteria, usage examples, practical advice for SFT data preparation, and academic references.
Usage Examples and Test Data: A new example script (evaluate_instruction_quality.py) demonstrates how to use these evaluators for clarity, difficulty, combined assessment, and difficulty distribution analysis, accompanied by new sample instruction data (instructions.jsonl).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two new LLM-based evaluators for instruction quality in Supervised Fine-Tuning (SFT) datasets: LLMInstructionClarity and LLMTaskDifficulty. The LLMInstructionClarity evaluator assesses instructions across dimensions like self-descriptiveness, consistency, specificity, and completeness, while LLMTaskDifficulty evaluates cognitive complexity, step complexity, domain knowledge, and constraint density. Both evaluators include detailed LLM prompts, parse JSON responses, and provide comprehensive evaluation results. A new documentation guide (instruction_quality_guide.md) and an example script (evaluate_instruction_quality.py) are added to demonstrate their usage, covering SFT data preparation, quality standards, and difficulty distribution analysis. The metrics.md file is updated to include these new metrics. Review comments suggest refactoring duplicated markdown stripping logic in the evaluator response processing, correcting the 'Paper Source' for the new metrics in metrics.md to reference academic papers, and making the hardcoded difficulty distribution thresholds in the example script configurable and consistent with the new guide's recommendations.

gemini-code-assist · 2025-12-23T09:38:19Z

dingo/model/llm/instruction_quality/llm_instruction_clarity.py

+            # 移除可能的 markdown 代码块标记
+            response = response.strip()
+            if response.startswith("```json"):
+                response = response[7:]
+            if response.startswith("```"):
+                response = response[3:]
+            if response.endswith("```"):
+                response = response[:-3]
+            response = response.strip()


The logic for stripping markdown code block markers is duplicated here and in llm_task_difficulty.py. Consider extracting this into a shared utility function or a method in BaseOpenAI to improve maintainability and reduce redundancy.

Suggested change

# 移除可能的 markdown 代码块标记

response = response.strip()

if response.startswith("```json"):

response = response[7:]

if response.startswith("```"):

response = response[3:]

if response.endswith("```"):

response = response[:-3]

response = response.strip()

response = cls._strip_json_markdown(response)

parsed = json.loads(response)

gemini-code-assist · 2025-12-23T09:38:19Z

dingo/model/llm/instruction_quality/llm_task_difficulty.py

+            response = response.strip()
+            if response.startswith("```json"):
+                response = response[7:]
+            if response.startswith("```"):
+                response = response[3:]
+            if response.endswith("```"):
+                response = response[:-3]
+            response = response.strip()


This JSON markdown stripping logic is duplicated from llm_instruction_clarity.py. It would be more maintainable to extract this into a shared helper function or a method in the BaseOpenAI class.

response = cls._strip_json_markdown(response) parsed = json.loads(response)

gemini-code-assist · 2025-12-23T09:38:19Z

docs/metrics.md

+| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |
+| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |


The Paper Source for LLMInstructionClarity and LLMTaskDifficulty is listed as "Internal Implementation". However, the Python files for these evaluators (llm_instruction_clarity.py and llm_task_difficulty.py) explicitly reference academic papers in their _metric_info and docstrings. Please update this documentation to reflect the actual research papers that these metrics are based on for better transparency and academic rigor.

Suggested change

| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |

| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |

| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | IFEval (Google, 2023), Self-Instruct (UW, 2023) | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |

| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | OpenAI Math Problem Difficulty (2024), Google DeepMind Task Complexity (2023) | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |

gemini-code-assist · 2025-12-23T09:38:19Z

examples/sft/evaluate_instruction_quality.py

+    if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3:
+        print("  ⚠️  简单任务过多，考虑增加难度或过滤部分简单任务")


The condition difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3 uses a hardcoded 0.3 (30%) for the threshold. It might be more robust to define this as a constant or make it configurable, especially since the recommended distribution for 'Easy' tasks is 15-20% in the instruction_quality_guide.md.

Suggested change

if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3:

print(" ⚠️ 简单任务过多，考虑增加难度或过滤部分简单任务")

# 理想分布: Easy 20%, Moderate 50%, Hard 25%, Expert 5%

easy_threshold = 0.2 # Or a configurable value

if difficulty_counts["Easy (0-3)"] / len(good_list) > easy_threshold:

print(" ⚠️ 简单任务过多，考虑增加难度或过滤部分简单任务")

gemini-code-assist · 2025-12-23T09:38:19Z

examples/sft/evaluate_instruction_quality.py

+    if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3:
+        print("  ⚠️  中等难度任务不足，这是 SFT 的核心部分")


Similar to the previous comment, the condition difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3 uses a hardcoded 0.3 (30%) threshold. The instruction_quality_guide.md suggests 50-60% for 'Moderate' tasks. Using a constant or configurable value would improve clarity and maintainability.

Suggested change

if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3:

print(" ⚠️ 中等难度任务不足，这是 SFT 的核心部分")

moderate_threshold = 0.5 # Or a configurable value

if difficulty_counts["Moderate (4-6)"] / len(good_list) < moderate_threshold:

print(" ⚠️ 中等难度任务不足，这是 SFT 的核心部分")

gemini-code-assist · 2025-12-23T09:38:19Z

examples/sft/evaluate_instruction_quality.py

+    if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4:
+        print("  ⚠️  困难任务过多，可能影响训练效率")


The condition difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4 uses a hardcoded 0.4 (40%) threshold. The instruction_quality_guide.md suggests 20-25% for 'Hard' tasks. It would be better to use a constant or configurable value for this threshold.

Suggested change

if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4:

print(" ⚠️ 困难任务过多，可能影响训练效率")

hard_threshold = 0.25 # Or a configurable value

if difficulty_counts["Hard (7-8)"] / len(good_list) > hard_threshold:

print(" ⚠️ 困难任务过多，可能影响训练效率")

* feat: add Instruction Quality Evaluation * 📚 Auto-update metrics documentation --------- Co-authored-by: GitHub Action <[email protected]>

feat: add Instruction Quality Evaluation

bc1513c

📚 Auto-update metrics documentation

a8a1806

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

e06084 merged commit 146d604 into MigoXLab:dev Dec 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Instruction Quality Evaluation #313

feat: add Instruction Quality Evaluation #313

Uh oh!

e06084 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		\| `LLMInstructionClarity` \| LLMInstructionClarity \| Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness \| Internal Implementation \| [📊 See Results](Returns clarity score (0-10) and detailed analysis) \|
		\| `LLMTaskDifficulty` \| LLMTaskDifficulty \| Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density \| Internal Implementation \| [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) \|

		if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3:
		print(" ⚠️ 简单任务过多，考虑增加难度或过滤部分简单任务")

		if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3:
		print(" ⚠️ 中等难度任务不足，这是 SFT 的核心部分")

		if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4:
		print(" ⚠️ 困难任务过多，可能影响训练效率")

feat: add Instruction Quality Evaluation #313

feat: add Instruction Quality Evaluation #313

Uh oh!

Conversation

e06084 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants