diff --git a/README.md b/README.md
index 147c0172..f9f97400 100644
--- a/README.md
+++ b/README.md
@@ -98,7 +98,7 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo
### 🛠️ Flexible Grader Building Methods
Choose the build method that fits your requirements:
* **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
-* **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data.👉 [Automatic Rubric Generation Tutorial](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/)
+* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
* **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders.
diff --git a/README_zh.md b/README_zh.md
index bd143334..accf7e28 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -98,7 +98,7 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口
### 🛠️ 灵活的评分器构建方法
选择适合您需求的构建方法:
* **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
-* **数据驱动的评分标准:** 有一些示例但没有明确规则?使用我们的工具根据您的数据自动生成白盒评估标准(Rubrics)。👉 [自动评分标准生成教程](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/)
+* **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
* **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。
diff --git a/docs/building_graders/create_custom_graders.md b/docs/building_graders/create_custom_graders.md
index 18f05530..b59b6403 100644
--- a/docs/building_graders/create_custom_graders.md
+++ b/docs/building_graders/create_custom_graders.md
@@ -301,7 +301,7 @@ When running graders, focus on configuring data mappers to connect your dataset
## Next Steps
-- [Generate Graders from Data](generate_graders_from_data.md) — Automate grader creation from labeled examples
+- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data
- [Run Grading Tasks](../running_graders/run_tasks.md) — Evaluate your models at scale
- [Grader Analysis](../running_graders/grader_analysis.md) — Validate and analyze grader results
diff --git a/docs/building_graders/generate_graders_from_data.md b/docs/building_graders/generate_rubrics_as_graders.md
similarity index 70%
rename from docs/building_graders/generate_graders_from_data.md
rename to docs/building_graders/generate_rubrics_as_graders.md
index 9f076f98..839d51c7 100644
--- a/docs/building_graders/generate_graders_from_data.md
+++ b/docs/building_graders/generate_rubrics_as_graders.md
@@ -1,10 +1,15 @@
-# Generate Graders from Data
+# Generate Rubrics as Graders
-Automatically create evaluation graders from labeled data instead of manually designing criteria. The system learns evaluation rubrics by analyzing what makes responses good or bad in your dataset.
+Automatically create evaluation graders instead of manually designing criteria. OpenJudge provides two approaches:
+
+| Approach | Module | Data Required | Best For |
+|----------|--------|---------------|----------|
+| **Simple Rubric** | `simple_rubric` | Task description only | Quick prototyping, when you have no labeled data |
+| **Iterative Rubric** | `iterative_rubric` | Labeled preference data | Production quality, when you have training examples |
!!! tip "Key Benefits"
- **Save time** — Eliminate manual rubric design
- - **Data-driven** — Learn criteria from actual examples
+ - **Intelligent** — Learn criteria from labeled data (Iterative) or task description (Simple)
- **Consistent** — Produce reproducible evaluation standards
- **Scalable** — Quickly prototype graders for new domains
@@ -29,42 +34,157 @@ Theme: Completeness
- With rubrics, evaluations become reproducible and explainable
- The challenge: manually writing good rubrics is time-consuming and requires domain expertise
-**The solution:** Auto-Rubric automatically extracts these criteria from your labeled data.
-
+**The solution:** Automatically extract these criteria from your task description (Simple Rubric) or labeled data (Iterative Rubric).
-## How It Works
-Auto-Rubric extracts evaluation rubrics from preference data without training. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314).
+## When to Use Each Approach
-**Two-stage approach:**
+### Simple Rubric (Zero-Shot)
-1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another
-2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure
+Use when you have a clear task description but **no labeled data**.
-**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models.
-
-
- { width="100%" }
- Auto-Rubric Pipeline: From preference data to evaluation rubrics
-
+!!! tip "Use Simple Rubric When"
+ - You need to quickly prototype a grader
+ - You have no labeled preference or scored data
+ - Your task is well-defined and you can describe it clearly
+ - You want to get started immediately without data collection
+!!! warning "Limitations"
+ - Quality depends on task description clarity
+ - May not capture domain-specific nuances
+ - Less accurate than data-driven approaches
-## When to Use This Approach
+### Iterative Rubric (Data-Driven)
-Suppose you have a dataset of query-response pairs with quality labels (scores or rankings), and you want to create a grader that can evaluate new responses using the same criteria.
+Use when you have **labeled preference data** and want production-quality graders.
-!!! tip "Use Data-Driven Generation When"
+!!! tip "Use Iterative Rubric When"
- You have labeled evaluation data (preference pairs or scored responses)
- Manual rubric design is too time-consuming or subjective
- Your evaluation criteria are implicit and hard to articulate
+ - You need high accuracy for production use
!!! warning "Don't Use When"
- - You have no labeled data
+ - You have no labeled data (use Simple Rubric instead)
- Your criteria are already well-defined and documented
- Simple Code-Based evaluation is sufficient
+## Simple Rubric: Zero-Shot Generation
+
+Generate evaluation rubrics from task descriptions without any labeled data. The system uses an LLM to create relevant evaluation criteria based on your task context.
+
+### How It Works
+
+1. **Provide task description** — Describe what your system does
+2. **Add context** — Optionally provide usage scenario and sample queries
+3. **Generate rubrics** — LLM creates evaluation criteria automatically
+4. **Create grader** — Rubrics are injected into an LLMGrader
+
+### Quick Example
+
+```python
+import asyncio
+from openjudge.generator.simple_rubric import (
+ SimpleRubricsGenerator,
+ SimpleRubricsGeneratorConfig
+)
+from openjudge.models import OpenAIChatModel
+from openjudge.graders.schema import GraderMode
+
+async def main():
+ config = SimpleRubricsGeneratorConfig(
+ grader_name="translation_quality_grader",
+ model=OpenAIChatModel(model="qwen3-32b"),
+ grader_mode=GraderMode.POINTWISE,
+ task_description="English to Chinese translation assistant for technical documents. Generate rubrics in English.",
+ scenario="Users need accurate, fluent translations of technical content. Please respond in English.",
+ min_score=0,
+ max_score=5,
+ )
+
+ generator = SimpleRubricsGenerator(config)
+ grader = await generator.generate(
+ dataset=[],
+ sample_queries=[
+ "Translate: 'Machine learning is a subset of AI.'",
+ "Translate: 'The API endpoint returned an error.'",
+ ]
+ )
+
+ return grader
+
+grader = asyncio.run(main())
+```
+
+### Inspect Generated Rubrics
+
+```python
+print(grader.kwargs.get("rubrics"))
+```
-## Choose Your Evaluation Mode
+**Output (Example):**
+
+```
+1. Accuracy: Whether the translation correctly conveys the technical meaning of the original English text
+2. Fluency: Whether the translated Chinese is grammatically correct and natural-sounding
+3. Technical Appropriateness: Whether the terminology used in the translation is appropriate for a technical context
+4. Consistency: Whether similar terms or phrases are consistently translated throughout the response
+```
+
+### Evaluate Responses
+
+```python
+result = await grader.aevaluate(
+ query="Translate: 'The database query returned an error.'",
+ response="数据库查询返回了一个错误。"
+)
+print(result)
+```
+
+**Output:**
+
+```python
+GraderScore(
+ name='translation_quality_grader',
+ reason="The translation is accurate and correctly conveys the technical meaning of the original English text. The Chinese sentence is grammatically correct and natural-sounding, making it fluent. The terminology used ('数据库查询' for 'database query', '返回了一个错误' for 'returned an error') is appropriate for a technical context. Additionally, the terms are consistently translated throughout the response.",
+ score=5.0
+)
+```
+
+### Simple Rubric Configuration
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `grader_name` | `str` | required | Name for the generated grader |
+| `model` | `BaseChatModel` | required | LLM for generation and evaluation |
+| `grader_mode` | `GraderMode` | `POINTWISE` | `POINTWISE` or `LISTWISE` |
+| `task_description` | `str` | `""` | Description of the task |
+| `scenario` | `str` | `None` | Optional usage context |
+| `language` | `LanguageEnum` | `EN` | Language for prompts (`EN` or `ZH`) |
+| `min_score` | `int` | `0` | Minimum score (pointwise only) |
+| `max_score` | `int` | `1` | Maximum score (pointwise only) |
+| `default_rubrics` | `List[str]` | `[]` | Fallback rubrics if generation fails |
+| `max_retries` | `int` | `3` | Retry attempts for LLM calls |
+
+## Iterative Rubric: Data-Driven Generation
+
+Learn evaluation rubrics from labeled preference data. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314).
+
+### How It Works
+
+**Two-stage approach:**
+
+1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another
+2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure
+
+**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models.
+
+
+ { width="100%" }
+ Auto-Rubric Pipeline: From preference data to evaluation rubrics
+
+
+### Choose Your Evaluation Mode
| Mode | Config Class | Use Case | Data Format | Output |
|------|--------------|----------|-------------|--------|
@@ -76,7 +196,7 @@ Suppose you have a dataset of query-response pairs with quality labels (scores o
Pairwise is a special case of Listwise with exactly 2 responses. Use the same `IterativeListwiseRubricsGeneratorConfig` for both.
-## Complete Example: Build a Code Review Grader (Pointwise)
+### Complete Example: Build a Code Review Grader (Pointwise)
Let's walk through a complete example: building a grader that evaluates code explanation quality.
@@ -218,7 +338,7 @@ GraderScore(
)
```
-## Complete Example: Build a Code Solution Comparator (Pairwise)
+### Complete Example: Build a Code Solution Comparator (Pairwise)
Let's build a grader that compares two code implementations and determines which solution is better. This is useful for code review, interview assessment, or selecting the best implementation from multiple candidates.
@@ -394,7 +514,7 @@ GraderRank(
```
-## Configuration Reference
+## Iterative Rubric Configuration Reference
### Core Parameters
@@ -427,9 +547,29 @@ GraderRank(
- `LISTWISE_EVALUATION_TEMPLATE` — for ranking
+---
+
+## Choosing Between Simple and Iterative Rubric
+
+| Scenario | Recommended Approach |
+|----------|---------------------|
+| Quick prototype, no data | **Simple Rubric** |
+| Production grader with labeled data | **Iterative Rubric** |
+| Well-defined task, need fast setup | **Simple Rubric** |
+| Complex domain, implicit criteria | **Iterative Rubric** |
+| < 50 labeled examples | **Simple Rubric** (or collect more data) |
+| 50-100+ labeled examples | **Iterative Rubric** |
+
+!!! tip "Workflow Recommendation"
+ 1. Start with **Simple Rubric** for quick prototyping
+ 2. Collect preference data during initial deployment
+ 3. Upgrade to **Iterative Rubric** when you have 50+ labeled examples
+
+---
+
## Tips
-### Data Quality
+### Data Quality (Iterative Rubric)
!!! tip "Good Practices"
- Clear preference signals (good vs. bad is obvious)
@@ -440,7 +580,20 @@ GraderRank(
- Ambiguous cases where labels are debatable
- Noisy or contradictory labels
-### Parameter Tuning
+### Task Description Quality (Simple Rubric)
+
+!!! tip "Good Practices"
+ - Be specific about what your system does
+ - Include the target audience or use case
+ - Mention key quality dimensions you care about
+ - Provide representative sample queries
+
+!!! warning "Avoid"
+ - Vague descriptions like "chatbot" or "assistant"
+ - Missing context about the domain
+ - No sample queries (the LLM needs examples)
+
+### Parameter Tuning (Iterative Rubric)
| Goal | Recommended Settings |
|------|---------------------|
diff --git a/docs/building_graders/overview.md b/docs/building_graders/overview.md
index a7a60bb0..37c69ba6 100644
--- a/docs/building_graders/overview.md
+++ b/docs/building_graders/overview.md
@@ -68,11 +68,11 @@ Define evaluation logic using LLM judges or code-based functions with no trainin
**Learn more:** [Create Custom Graders →](create_custom_graders.md) | [Built-in Graders →](../built_in_graders/overview.md)
-### Approach 2: Generate Graders from Data
+### Approach 2: Generate Rubrics as Graders
-Automatically analyze evaluation data to create structured scoring rubrics. Provide 50-500 labeled examples, and the generator extracts patterns to build interpretable criteria. Generated graders produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.
+Automatically generate evaluation rubrics and create graders. Two approaches available: **Simple Rubric** generates rubrics from task descriptions (zero-shot, no data required), while **Iterative Rubric** learns from 50-500 labeled examples to extract patterns. Both produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.
-**Learn more:** [Generate Graders from Data →](generate_graders_from_data.md)
+**Learn more:** [Generate Rubrics as Graders →](generate_rubrics_as_graders.md)
### Approach 3: Train Reward Models
@@ -86,7 +86,7 @@ Train neural networks on preference data to learn evaluation criteria automatica
## Next Steps
- [Create Custom Graders](create_custom_graders.md) — Build graders using LLM or code-based logic
-- [Generate Graders from Data](generate_graders_from_data.md) — Auto-generate rubrics from labeled data
+- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data
- [Train with GRPO](training_grpo.md) — Train generative judge models with reinforcement learning
- [Built-in Graders](../built_in_graders/overview.md) — Explore pre-built graders to customize
- [Run Grading Tasks](../running_graders/run_tasks.md) — Deploy graders at scale with batch workflows
diff --git a/docs/index.md b/docs/index.md
index a6260660..6be0b039 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -20,7 +20,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
+ **Flexible Grader Building**: Choose the build method that fits your requirements:
- **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. Custom Grader Development Guide →
- - **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data. Automatic Rubric Generation Tutorial →
+ - **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. Generate Rubrics as Graders →
- **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. 🚧 Coming Soon
+ **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!🚧 Coming Soon
@@ -139,7 +139,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi