-
Notifications
You must be signed in to change notification settings - Fork 64
feat: add Instruction Quality Evaluation #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @e06084, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new feature for evaluating the quality of instructions, which is critical for preparing high-quality datasets for Supervised Fine-Tuning (SFT) of Large Language Models. It provides robust, LLM-based mechanisms to automatically assess both the clarity and the difficulty of instructions, ensuring that training data is well-formed and appropriately challenging. The addition is supported by detailed documentation and practical examples to facilitate adoption and effective use. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two new LLM-based evaluators for instruction quality in Supervised Fine-Tuning (SFT) datasets: LLMInstructionClarity and LLMTaskDifficulty. The LLMInstructionClarity evaluator assesses instructions across dimensions like self-descriptiveness, consistency, specificity, and completeness, while LLMTaskDifficulty evaluates cognitive complexity, step complexity, domain knowledge, and constraint density. Both evaluators include detailed LLM prompts, parse JSON responses, and provide comprehensive evaluation results. A new documentation guide (instruction_quality_guide.md) and an example script (evaluate_instruction_quality.py) are added to demonstrate their usage, covering SFT data preparation, quality standards, and difficulty distribution analysis. The metrics.md file is updated to include these new metrics. Review comments suggest refactoring duplicated markdown stripping logic in the evaluator response processing, correcting the 'Paper Source' for the new metrics in metrics.md to reference academic papers, and making the hardcoded difficulty distribution thresholds in the example script configurable and consistent with the new guide's recommendations.
| # 移除可能的 markdown 代码块标记 | ||
| response = response.strip() | ||
| if response.startswith("```json"): | ||
| response = response[7:] | ||
| if response.startswith("```"): | ||
| response = response[3:] | ||
| if response.endswith("```"): | ||
| response = response[:-3] | ||
| response = response.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for stripping markdown code block markers is duplicated here and in llm_task_difficulty.py. Consider extracting this into a shared utility function or a method in BaseOpenAI to improve maintainability and reduce redundancy.
| # 移除可能的 markdown 代码块标记 | |
| response = response.strip() | |
| if response.startswith("```json"): | |
| response = response[7:] | |
| if response.startswith("```"): | |
| response = response[3:] | |
| if response.endswith("```"): | |
| response = response[:-3] | |
| response = response.strip() | |
| response = cls._strip_json_markdown(response) | |
| parsed = json.loads(response) |
| response = response.strip() | ||
| if response.startswith("```json"): | ||
| response = response[7:] | ||
| if response.startswith("```"): | ||
| response = response[3:] | ||
| if response.endswith("```"): | ||
| response = response[:-3] | ||
| response = response.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) | | ||
| | `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Paper Source for LLMInstructionClarity and LLMTaskDifficulty is listed as "Internal Implementation". However, the Python files for these evaluators (llm_instruction_clarity.py and llm_task_difficulty.py) explicitly reference academic papers in their _metric_info and docstrings. Please update this documentation to reflect the actual research papers that these metrics are based on for better transparency and academic rigor.
| | `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | Internal Implementation | [📊 See Results](Returns clarity score (0-10) and detailed analysis) | | |
| | `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | Internal Implementation | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) | | |
| | `LLMInstructionClarity` | LLMInstructionClarity | Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness | IFEval (Google, 2023), Self-Instruct (UW, 2023) | [📊 See Results](Returns clarity score (0-10) and detailed analysis) | | |
| | `LLMTaskDifficulty` | LLMTaskDifficulty | Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density | OpenAI Math Problem Difficulty (2024), Google DeepMind Task Complexity (2023) | [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) | |
| if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3: | ||
| print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3 uses a hardcoded 0.3 (30%) for the threshold. It might be more robust to define this as a constant or make it configurable, especially since the recommended distribution for 'Easy' tasks is 15-20% in the instruction_quality_guide.md.
| if difficulty_counts["Easy (0-3)"] / len(good_list) > 0.3: | |
| print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务") | |
| # 理想分布: Easy 20%, Moderate 50%, Hard 25%, Expert 5% | |
| easy_threshold = 0.2 # Or a configurable value | |
| if difficulty_counts["Easy (0-3)"] / len(good_list) > easy_threshold: | |
| print(" ⚠️ 简单任务过多,考虑增加难度或过滤部分简单任务") |
| if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3: | ||
| print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, the condition difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3 uses a hardcoded 0.3 (30%) threshold. The instruction_quality_guide.md suggests 50-60% for 'Moderate' tasks. Using a constant or configurable value would improve clarity and maintainability.
| if difficulty_counts["Moderate (4-6)"] / len(good_list) < 0.3: | |
| print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分") | |
| moderate_threshold = 0.5 # Or a configurable value | |
| if difficulty_counts["Moderate (4-6)"] / len(good_list) < moderate_threshold: | |
| print(" ⚠️ 中等难度任务不足,这是 SFT 的核心部分") |
| if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4: | ||
| print(" ⚠️ 困难任务过多,可能影响训练效率") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4 uses a hardcoded 0.4 (40%) threshold. The instruction_quality_guide.md suggests 20-25% for 'Hard' tasks. It would be better to use a constant or configurable value for this threshold.
| if difficulty_counts["Hard (7-8)"] / len(good_list) > 0.4: | |
| print(" ⚠️ 困难任务过多,可能影响训练效率") | |
| hard_threshold = 0.25 # Or a configurable value | |
| if difficulty_counts["Hard (7-8)"] / len(good_list) > hard_threshold: | |
| print(" ⚠️ 困难任务过多,可能影响训练效率") |
* feat: add Instruction Quality Evaluation * 📚 Auto-update metrics documentation --------- Co-authored-by: GitHub Action <[email protected]>
No description provided.