Skip to content

Conversation

@e06084
Copy link
Collaborator

@e06084 e06084 commented Dec 23, 2025

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @e06084, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature for evaluating the quality of instructions, which is critical for preparing high-quality datasets for Supervised Fine-Tuning (SFT) of Large Language Models. It provides robust, LLM-based mechanisms to automatically assess both the clarity and the difficulty of instructions, ensuring that training data is well-formed and appropriately challenging. The addition is supported by detailed documentation and practical examples to facilitate adoption and effective use.

Highlights

  • New Instruction Quality Evaluation Metrics: Introduced two new LLM-based evaluators: LLMInstructionClarity and LLMTaskDifficulty to assess the quality of instructions in SFT (Supervised Fine-Tuning) datasets.
  • Instruction Clarity Evaluator: The LLMInstructionClarity metric evaluates instructions across four dimensions: self-descriptiveness, consistency, specificity, and completeness, providing a score from 0-10.
  • Task Difficulty Evaluator: The LLMTaskDifficulty metric assesses task complexity based on cognitive complexity, step complexity, domain knowledge requirements, and constraint density, also providing a score from 0-10 and categorizing tasks into difficulty levels (Easy, Moderate, Hard, Expert).
  • Comprehensive Documentation: A new guide, instruction_quality_guide.md, has been added, detailing the new metrics, their scoring criteria, usage examples, practical advice for SFT data preparation, and academic references.
  • Usage Examples and Test Data: A new example script (evaluate_instruction_quality.py) demonstrates how to use these evaluators for clarity, difficulty, combined assessment, and difficulty distribution analysis, accompanied by new sample instruction data (instructions.jsonl).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new LLM-based evaluators for instruction quality in Supervised Fine-Tuning (SFT) datasets: LLMInstructionClarity and LLMTaskDifficulty. The LLMInstructionClarity evaluator assesses instructions across dimensions like self-descriptiveness, consistency, specificity, and completeness, while LLMTaskDifficulty evaluates cognitive complexity, step complexity, domain knowledge, and constraint density. Both evaluators include detailed LLM prompts, parse JSON responses, and provide comprehensive evaluation results. A new documentation guide (instruction_quality_guide.md) and an example script (evaluate_instruction_quality.py) are added to demonstrate their usage, covering SFT data preparation, quality standards, and difficulty distribution analysis. The metrics.md file is updated to include these new metrics. Review comments suggest refactoring duplicated markdown stripping logic in the evaluator response processing, correcting the 'Paper Source' for the new metrics in metrics.md to reference academic papers, and making the hardcoded difficulty distribution thresholds in the example script configurable and consistent with the new guide's recommendations.

Comment on lines +229 to +237
# 移除可能的 markdown 代码块标记
response = response.strip()
if response.startswith("```json"):
response = response[7:]
if response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for stripping markdown code block markers is duplicated here and in llm_task_difficulty.py. Consider extracting this into a shared utility function or a method in BaseOpenAI to improve maintainability and reduce redundancy.

Suggested change
# 移除可能的 markdown 代码块标记
response = response.strip()
if response.startswith("```json"):
response = response[7:]
if response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()
response = cls._strip_json_markdown(response)
parsed = json.loads(response)

Comment on lines +271 to +278
response = response.strip()
if response.startswith("```json"):
response = response[7:]
if response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This JSON markdown stripping logic is duplicated from llm_instruction_clarity.py. It would be more maintainable to extract this into a shared helper function or a method in the BaseOpenAI class.

            response = cls._strip_json_markdown(response)
            parsed = json.loads(response)

Comment on lines +35 to +36
| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |
| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Paper Source for LLMInstructionClarity and LLMTaskDifficulty is listed as "Internal Implementation". However, the Python files for these evaluators (llm_instruction_clarity.py and llm_task_difficulty.py) explicitly reference academic papers in their _metric_info and docstrings. Please update this documentation to reflect the actual research papers that these metrics are based on for better transparency and academic rigor.

Suggested change
| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |
| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |
| `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | IFEval (Google, 2023), Self-Instruct (UW, 2023) | [📊 See Results](Returns clarity score (0-10) and detailed analysis) |
| `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | OpenAI Math Problem Difficulty (2024), Google DeepMind Task Complexity (2023) | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) |

Comment on lines +334 to +335
if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3:
print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3 uses a hardcoded 0.3 (30%) for the threshold. It might be more robust to define this as a constant or make it configurable, especially since the recommended distribution for 'Easy' tasks is 15-20% in the instruction_quality_guide.md.

Suggested change
if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3:
print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务")
# 理想分布: Easy 20%, Moderate 50%, Hard 25%, Expert 5%
easy_threshold = 0.2 # Or a configurable value
if difficulty_counts["Easy (0-3)"] / len(good_list) > easy_threshold:
print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务")

Comment on lines +336 to +337
if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3:
print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, the condition difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3 uses a hardcoded 0.3 (30%) threshold. The instruction_quality_guide.md suggests 50-60% for 'Moderate' tasks. Using a constant or configurable value would improve clarity and maintainability.

Suggested change
if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3:
print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分")
moderate_threshold = 0.5 # Or a configurable value
if difficulty_counts["Moderate (4-6)"] / len(good_list) < moderate_threshold:
print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分")

Comment on lines +338 to +339
if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4:
print(" ⚠️ 困难任务过多,可能影响训练效率")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4 uses a hardcoded 0.4 (40%) threshold. The instruction_quality_guide.md suggests 20-25% for 'Hard' tasks. It would be better to use a constant or configurable value for this threshold.

Suggested change
if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4:
print(" ⚠️ 困难任务过多,可能影响训练效率")
hard_threshold = 0.25 # Or a configurable value
if difficulty_counts["Hard (7-8)"] / len(good_list) > hard_threshold:
print(" ⚠️ 困难任务过多,可能影响训练效率")

@e06084 e06084 merged commit 146d604 into MigoXLab:dev Dec 23, 2025
tenwanft pushed a commit to tenwanft/dingo that referenced this pull request Dec 24, 2025
* feat: add Instruction Quality Evaluation

* 📚 Auto-update metrics documentation

---------

Co-authored-by: GitHub Action <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants