diff --git a/README.md b/README.md index 147c0172..f9f97400 100644 --- a/README.md +++ b/README.md @@ -98,7 +98,7 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo ### 🛠️ Flexible Grader Building Methods Choose the build method that fits your requirements: * **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) -* **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data.👉 [Automatic Rubric Generation Tutorial](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/) +* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/) * **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. diff --git a/README_zh.md b/README_zh.md index bd143334..accf7e28 100644 --- a/README_zh.md +++ b/README_zh.md @@ -98,7 +98,7 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口 ### 🛠️ 灵活的评分器构建方法 选择适合您需求的构建方法: * **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) -* **数据驱动的评分标准:** 有一些示例但没有明确规则?使用我们的工具根据您的数据自动生成白盒评估标准(Rubrics)。👉 [自动评分标准生成教程](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/) +* **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/) * **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。 diff --git a/docs/building_graders/create_custom_graders.md b/docs/building_graders/create_custom_graders.md index 18f05530..b59b6403 100644 --- a/docs/building_graders/create_custom_graders.md +++ b/docs/building_graders/create_custom_graders.md @@ -301,7 +301,7 @@ When running graders, focus on configuring data mappers to connect your dataset ## Next Steps -- [Generate Graders from Data](generate_graders_from_data.md) — Automate grader creation from labeled examples +- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data - [Run Grading Tasks](../running_graders/run_tasks.md) — Evaluate your models at scale - [Grader Analysis](../running_graders/grader_analysis.md) — Validate and analyze grader results diff --git a/docs/building_graders/generate_graders_from_data.md b/docs/building_graders/generate_rubrics_as_graders.md similarity index 70% rename from docs/building_graders/generate_graders_from_data.md rename to docs/building_graders/generate_rubrics_as_graders.md index 9f076f98..839d51c7 100644 --- a/docs/building_graders/generate_graders_from_data.md +++ b/docs/building_graders/generate_rubrics_as_graders.md @@ -1,10 +1,15 @@ -# Generate Graders from Data +# Generate Rubrics as Graders -Automatically create evaluation graders from labeled data instead of manually designing criteria. The system learns evaluation rubrics by analyzing what makes responses good or bad in your dataset. +Automatically create evaluation graders instead of manually designing criteria. OpenJudge provides two approaches: + +| Approach | Module | Data Required | Best For | +|----------|--------|---------------|----------| +| **Simple Rubric** | `simple_rubric` | Task description only | Quick prototyping, when you have no labeled data | +| **Iterative Rubric** | `iterative_rubric` | Labeled preference data | Production quality, when you have training examples | !!! tip "Key Benefits" - **Save time** — Eliminate manual rubric design - - **Data-driven** — Learn criteria from actual examples + - **Intelligent** — Learn criteria from labeled data (Iterative) or task description (Simple) - **Consistent** — Produce reproducible evaluation standards - **Scalable** — Quickly prototype graders for new domains @@ -29,42 +34,157 @@ Theme: Completeness - With rubrics, evaluations become reproducible and explainable - The challenge: manually writing good rubrics is time-consuming and requires domain expertise -**The solution:** Auto-Rubric automatically extracts these criteria from your labeled data. - +**The solution:** Automatically extract these criteria from your task description (Simple Rubric) or labeled data (Iterative Rubric). -## How It Works -Auto-Rubric extracts evaluation rubrics from preference data without training. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314). +## When to Use Each Approach -**Two-stage approach:** +### Simple Rubric (Zero-Shot) -1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another -2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure +Use when you have a clear task description but **no labeled data**. -**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models. - -
- ![Auto-Rubric Pipeline Overview](../images/auto_rubric_overview.png){ width="100%" } -
Auto-Rubric Pipeline: From preference data to evaluation rubrics
-
+!!! tip "Use Simple Rubric When" + - You need to quickly prototype a grader + - You have no labeled preference or scored data + - Your task is well-defined and you can describe it clearly + - You want to get started immediately without data collection +!!! warning "Limitations" + - Quality depends on task description clarity + - May not capture domain-specific nuances + - Less accurate than data-driven approaches -## When to Use This Approach +### Iterative Rubric (Data-Driven) -Suppose you have a dataset of query-response pairs with quality labels (scores or rankings), and you want to create a grader that can evaluate new responses using the same criteria. +Use when you have **labeled preference data** and want production-quality graders. -!!! tip "Use Data-Driven Generation When" +!!! tip "Use Iterative Rubric When" - You have labeled evaluation data (preference pairs or scored responses) - Manual rubric design is too time-consuming or subjective - Your evaluation criteria are implicit and hard to articulate + - You need high accuracy for production use !!! warning "Don't Use When" - - You have no labeled data + - You have no labeled data (use Simple Rubric instead) - Your criteria are already well-defined and documented - Simple Code-Based evaluation is sufficient +## Simple Rubric: Zero-Shot Generation + +Generate evaluation rubrics from task descriptions without any labeled data. The system uses an LLM to create relevant evaluation criteria based on your task context. + +### How It Works + +1. **Provide task description** — Describe what your system does +2. **Add context** — Optionally provide usage scenario and sample queries +3. **Generate rubrics** — LLM creates evaluation criteria automatically +4. **Create grader** — Rubrics are injected into an LLMGrader + +### Quick Example + +```python +import asyncio +from openjudge.generator.simple_rubric import ( + SimpleRubricsGenerator, + SimpleRubricsGeneratorConfig +) +from openjudge.models import OpenAIChatModel +from openjudge.graders.schema import GraderMode + +async def main(): + config = SimpleRubricsGeneratorConfig( + grader_name="translation_quality_grader", + model=OpenAIChatModel(model="qwen3-32b"), + grader_mode=GraderMode.POINTWISE, + task_description="English to Chinese translation assistant for technical documents. Generate rubrics in English.", + scenario="Users need accurate, fluent translations of technical content. Please respond in English.", + min_score=0, + max_score=5, + ) + + generator = SimpleRubricsGenerator(config) + grader = await generator.generate( + dataset=[], + sample_queries=[ + "Translate: 'Machine learning is a subset of AI.'", + "Translate: 'The API endpoint returned an error.'", + ] + ) + + return grader + +grader = asyncio.run(main()) +``` + +### Inspect Generated Rubrics + +```python +print(grader.kwargs.get("rubrics")) +``` -## Choose Your Evaluation Mode +**Output (Example):** + +``` +1. Accuracy: Whether the translation correctly conveys the technical meaning of the original English text +2. Fluency: Whether the translated Chinese is grammatically correct and natural-sounding +3. Technical Appropriateness: Whether the terminology used in the translation is appropriate for a technical context +4. Consistency: Whether similar terms or phrases are consistently translated throughout the response +``` + +### Evaluate Responses + +```python +result = await grader.aevaluate( + query="Translate: 'The database query returned an error.'", + response="数据库查询返回了一个错误。" +) +print(result) +``` + +**Output:** + +```python +GraderScore( + name='translation_quality_grader', + reason="The translation is accurate and correctly conveys the technical meaning of the original English text. The Chinese sentence is grammatically correct and natural-sounding, making it fluent. The terminology used ('数据库查询' for 'database query', '返回了一个错误' for 'returned an error') is appropriate for a technical context. Additionally, the terms are consistently translated throughout the response.", + score=5.0 +) +``` + +### Simple Rubric Configuration + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `grader_name` | `str` | required | Name for the generated grader | +| `model` | `BaseChatModel` | required | LLM for generation and evaluation | +| `grader_mode` | `GraderMode` | `POINTWISE` | `POINTWISE` or `LISTWISE` | +| `task_description` | `str` | `""` | Description of the task | +| `scenario` | `str` | `None` | Optional usage context | +| `language` | `LanguageEnum` | `EN` | Language for prompts (`EN` or `ZH`) | +| `min_score` | `int` | `0` | Minimum score (pointwise only) | +| `max_score` | `int` | `1` | Maximum score (pointwise only) | +| `default_rubrics` | `List[str]` | `[]` | Fallback rubrics if generation fails | +| `max_retries` | `int` | `3` | Retry attempts for LLM calls | + +## Iterative Rubric: Data-Driven Generation + +Learn evaluation rubrics from labeled preference data. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314). + +### How It Works + +**Two-stage approach:** + +1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another +2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure + +**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models. + +
+ ![Auto-Rubric Pipeline Overview](../images/auto_rubric_overview.png){ width="100%" } +
Auto-Rubric Pipeline: From preference data to evaluation rubrics
+
+ +### Choose Your Evaluation Mode | Mode | Config Class | Use Case | Data Format | Output | |------|--------------|----------|-------------|--------| @@ -76,7 +196,7 @@ Suppose you have a dataset of query-response pairs with quality labels (scores o Pairwise is a special case of Listwise with exactly 2 responses. Use the same `IterativeListwiseRubricsGeneratorConfig` for both. -## Complete Example: Build a Code Review Grader (Pointwise) +### Complete Example: Build a Code Review Grader (Pointwise) Let's walk through a complete example: building a grader that evaluates code explanation quality. @@ -218,7 +338,7 @@ GraderScore( ) ``` -## Complete Example: Build a Code Solution Comparator (Pairwise) +### Complete Example: Build a Code Solution Comparator (Pairwise) Let's build a grader that compares two code implementations and determines which solution is better. This is useful for code review, interview assessment, or selecting the best implementation from multiple candidates. @@ -394,7 +514,7 @@ GraderRank( ``` -## Configuration Reference +## Iterative Rubric Configuration Reference ### Core Parameters @@ -427,9 +547,29 @@ GraderRank( - `LISTWISE_EVALUATION_TEMPLATE` — for ranking +--- + +## Choosing Between Simple and Iterative Rubric + +| Scenario | Recommended Approach | +|----------|---------------------| +| Quick prototype, no data | **Simple Rubric** | +| Production grader with labeled data | **Iterative Rubric** | +| Well-defined task, need fast setup | **Simple Rubric** | +| Complex domain, implicit criteria | **Iterative Rubric** | +| < 50 labeled examples | **Simple Rubric** (or collect more data) | +| 50-100+ labeled examples | **Iterative Rubric** | + +!!! tip "Workflow Recommendation" + 1. Start with **Simple Rubric** for quick prototyping + 2. Collect preference data during initial deployment + 3. Upgrade to **Iterative Rubric** when you have 50+ labeled examples + +--- + ## Tips -### Data Quality +### Data Quality (Iterative Rubric) !!! tip "Good Practices" - Clear preference signals (good vs. bad is obvious) @@ -440,7 +580,20 @@ GraderRank( - Ambiguous cases where labels are debatable - Noisy or contradictory labels -### Parameter Tuning +### Task Description Quality (Simple Rubric) + +!!! tip "Good Practices" + - Be specific about what your system does + - Include the target audience or use case + - Mention key quality dimensions you care about + - Provide representative sample queries + +!!! warning "Avoid" + - Vague descriptions like "chatbot" or "assistant" + - Missing context about the domain + - No sample queries (the LLM needs examples) + +### Parameter Tuning (Iterative Rubric) | Goal | Recommended Settings | |------|---------------------| diff --git a/docs/building_graders/overview.md b/docs/building_graders/overview.md index a7a60bb0..37c69ba6 100644 --- a/docs/building_graders/overview.md +++ b/docs/building_graders/overview.md @@ -68,11 +68,11 @@ Define evaluation logic using LLM judges or code-based functions with no trainin **Learn more:** [Create Custom Graders →](create_custom_graders.md) | [Built-in Graders →](../built_in_graders/overview.md) -### Approach 2: Generate Graders from Data +### Approach 2: Generate Rubrics as Graders -Automatically analyze evaluation data to create structured scoring rubrics. Provide 50-500 labeled examples, and the generator extracts patterns to build interpretable criteria. Generated graders produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement. +Automatically generate evaluation rubrics and create graders. Two approaches available: **Simple Rubric** generates rubrics from task descriptions (zero-shot, no data required), while **Iterative Rubric** learns from 50-500 labeled examples to extract patterns. Both produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement. -**Learn more:** [Generate Graders from Data →](generate_graders_from_data.md) +**Learn more:** [Generate Rubrics as Graders →](generate_rubrics_as_graders.md) ### Approach 3: Train Reward Models @@ -86,7 +86,7 @@ Train neural networks on preference data to learn evaluation criteria automatica ## Next Steps - [Create Custom Graders](create_custom_graders.md) — Build graders using LLM or code-based logic -- [Generate Graders from Data](generate_graders_from_data.md) — Auto-generate rubrics from labeled data +- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data - [Train with GRPO](training_grpo.md) — Train generative judge models with reinforcement learning - [Built-in Graders](../built_in_graders/overview.md) — Explore pre-built graders to customize - [Run Grading Tasks](../running_graders/run_tasks.md) — Deploy graders at scale with batch workflows diff --git a/docs/index.md b/docs/index.md index a6260660..6be0b039 100644 --- a/docs/index.md +++ b/docs/index.md @@ -20,7 +20,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi + **Flexible Grader Building**: Choose the build method that fits your requirements: - **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. Custom Grader Development Guide - - **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data. Automatic Rubric Generation Tutorial + - **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. Generate Rubrics as Graders - **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. 🚧 Coming Soon + **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!🚧 Coming Soon @@ -139,7 +139,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi

- +

Data-Driven Rubrics

diff --git a/mkdocs.yml b/mkdocs.yml index 3d3537e0..4b2346a0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -34,7 +34,7 @@ nav: - Building Graders: - Overview: building_graders/overview.md - Create Custom Graders: building_graders/create_custom_graders.md - - Generate Graders from Data: building_graders/generate_graders_from_data.md + - Generate Rubrics as Graders: building_graders/generate_rubrics_as_graders.md # - Train Reward Models: building_graders/training/overview.md # Coming soon - Running Graders: