Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo
### 🛠️ Flexible Grader Building Methods
Choose the build method that fits your requirements:
* **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
* **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data.👉 [Automatic Rubric Generation Tutorial](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/)
* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
* **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders.


Expand Down
2 changes: 1 addition & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口
### 🛠️ 灵活的评分器构建方法
选择适合您需求的构建方法:
* **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
* **数据驱动的评分标准:** 有一些示例但没有明确规则?使用我们的工具根据您的数据自动生成白盒评估标准(Rubrics)。👉 [自动评分标准生成教程](https://modelscope.github.io/OpenJudge/building_graders/generate_graders_from_data/)
* **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
* **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。


Expand Down
2 changes: 1 addition & 1 deletion docs/building_graders/create_custom_graders.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ When running graders, focus on configuring data mappers to connect your dataset

## Next Steps

- [Generate Graders from Data](generate_graders_from_data.md) — Automate grader creation from labeled examples
- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data
- [Run Grading Tasks](../running_graders/run_tasks.md) — Evaluate your models at scale
- [Grader Analysis](../running_graders/grader_analysis.md) — Validate and analyze grader results

Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Generate Graders from Data
# Generate Rubrics as Graders

Automatically create evaluation graders from labeled data instead of manually designing criteria. The system learns evaluation rubrics by analyzing what makes responses good or bad in your dataset.
Automatically create evaluation graders instead of manually designing criteria. OpenJudge provides two approaches:

| Approach | Module | Data Required | Best For |
|----------|--------|---------------|----------|
| **Simple Rubric** | `simple_rubric` | Task description only | Quick prototyping, when you have no labeled data |
| **Iterative Rubric** | `iterative_rubric` | Labeled preference data | Production quality, when you have training examples |

!!! tip "Key Benefits"
- **Save time** — Eliminate manual rubric design
- **Data-driven** — Learn criteria from actual examples
- **Intelligent** — Learn criteria from labeled data (Iterative) or task description (Simple)
- **Consistent** — Produce reproducible evaluation standards
- **Scalable** — Quickly prototype graders for new domains

Expand All @@ -29,42 +34,157 @@ Theme: Completeness
- With rubrics, evaluations become reproducible and explainable
- The challenge: manually writing good rubrics is time-consuming and requires domain expertise

**The solution:** Auto-Rubric automatically extracts these criteria from your labeled data.

**The solution:** Automatically extract these criteria from your task description (Simple Rubric) or labeled data (Iterative Rubric).

## How It Works

Auto-Rubric extracts evaluation rubrics from preference data without training. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314).
## When to Use Each Approach

**Two-stage approach:**
### Simple Rubric (Zero-Shot)

1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another
2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure
Use when you have a clear task description but **no labeled data**.

**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models.

<figure markdown="span">
![Auto-Rubric Pipeline Overview](../images/auto_rubric_overview.png){ width="100%" }
<figcaption>Auto-Rubric Pipeline: From preference data to evaluation rubrics</figcaption>
</figure>
!!! tip "Use Simple Rubric When"
- You need to quickly prototype a grader
- You have no labeled preference or scored data
- Your task is well-defined and you can describe it clearly
- You want to get started immediately without data collection

!!! warning "Limitations"
- Quality depends on task description clarity
- May not capture domain-specific nuances
- Less accurate than data-driven approaches

## When to Use This Approach
### Iterative Rubric (Data-Driven)

Suppose you have a dataset of query-response pairs with quality labels (scores or rankings), and you want to create a grader that can evaluate new responses using the same criteria.
Use when you have **labeled preference data** and want production-quality graders.

!!! tip "Use Data-Driven Generation When"
!!! tip "Use Iterative Rubric When"
- You have labeled evaluation data (preference pairs or scored responses)
- Manual rubric design is too time-consuming or subjective
- Your evaluation criteria are implicit and hard to articulate
- You need high accuracy for production use

!!! warning "Don't Use When"
- You have no labeled data
- You have no labeled data (use Simple Rubric instead)
- Your criteria are already well-defined and documented
- Simple Code-Based evaluation is sufficient

## Simple Rubric: Zero-Shot Generation

Generate evaluation rubrics from task descriptions without any labeled data. The system uses an LLM to create relevant evaluation criteria based on your task context.

### How It Works

1. **Provide task description** — Describe what your system does
2. **Add context** — Optionally provide usage scenario and sample queries
3. **Generate rubrics** — LLM creates evaluation criteria automatically
4. **Create grader** — Rubrics are injected into an LLMGrader

### Quick Example

```python
import asyncio
from openjudge.generator.simple_rubric import (
SimpleRubricsGenerator,
SimpleRubricsGeneratorConfig
)
from openjudge.models import OpenAIChatModel
from openjudge.graders.schema import GraderMode

async def main():
config = SimpleRubricsGeneratorConfig(
grader_name="translation_quality_grader",
model=OpenAIChatModel(model="qwen3-32b"),
grader_mode=GraderMode.POINTWISE,
task_description="English to Chinese translation assistant for technical documents. Generate rubrics in English.",
scenario="Users need accurate, fluent translations of technical content. Please respond in English.",
min_score=0,
max_score=5,
)

generator = SimpleRubricsGenerator(config)
grader = await generator.generate(
dataset=[],
sample_queries=[
"Translate: 'Machine learning is a subset of AI.'",
"Translate: 'The API endpoint returned an error.'",
]
)

return grader

grader = asyncio.run(main())
```

### Inspect Generated Rubrics

```python
print(grader.kwargs.get("rubrics"))
```

## Choose Your Evaluation Mode
**Output (Example):**

```
1. Accuracy: Whether the translation correctly conveys the technical meaning of the original English text
2. Fluency: Whether the translated Chinese is grammatically correct and natural-sounding
3. Technical Appropriateness: Whether the terminology used in the translation is appropriate for a technical context
4. Consistency: Whether similar terms or phrases are consistently translated throughout the response
```

### Evaluate Responses

```python
result = await grader.aevaluate(
query="Translate: 'The database query returned an error.'",
response="数据库查询返回了一个错误。"
)
print(result)
```

**Output:**

```python
GraderScore(
name='translation_quality_grader',
reason="The translation is accurate and correctly conveys the technical meaning of the original English text. The Chinese sentence is grammatically correct and natural-sounding, making it fluent. The terminology used ('数据库查询' for 'database query', '返回了一个错误' for 'returned an error') is appropriate for a technical context. Additionally, the terms are consistently translated throughout the response.",
score=5.0
)
```

### Simple Rubric Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `grader_name` | `str` | required | Name for the generated grader |
| `model` | `BaseChatModel` | required | LLM for generation and evaluation |
| `grader_mode` | `GraderMode` | `POINTWISE` | `POINTWISE` or `LISTWISE` |
| `task_description` | `str` | `""` | Description of the task |
| `scenario` | `str` | `None` | Optional usage context |
| `language` | `LanguageEnum` | `EN` | Language for prompts (`EN` or `ZH`) |
| `min_score` | `int` | `0` | Minimum score (pointwise only) |
| `max_score` | `int` | `1` | Maximum score (pointwise only) |
| `default_rubrics` | `List[str]` | `[]` | Fallback rubrics if generation fails |
| `max_retries` | `int` | `3` | Retry attempts for LLM calls |

## Iterative Rubric: Data-Driven Generation

Learn evaluation rubrics from labeled preference data. Based on [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314).

### How It Works

**Two-stage approach:**

1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another
2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure

**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models.

<figure markdown="span">
![Auto-Rubric Pipeline Overview](../images/auto_rubric_overview.png){ width="100%" }
<figcaption>Auto-Rubric Pipeline: From preference data to evaluation rubrics</figcaption>
</figure>

### Choose Your Evaluation Mode

| Mode | Config Class | Use Case | Data Format | Output |
|------|--------------|----------|-------------|--------|
Expand All @@ -76,7 +196,7 @@ Suppose you have a dataset of query-response pairs with quality labels (scores o
Pairwise is a special case of Listwise with exactly 2 responses. Use the same `IterativeListwiseRubricsGeneratorConfig` for both.


## Complete Example: Build a Code Review Grader (Pointwise)
### Complete Example: Build a Code Review Grader (Pointwise)

Let's walk through a complete example: building a grader that evaluates code explanation quality.

Expand Down Expand Up @@ -218,7 +338,7 @@ GraderScore(
)
```

## Complete Example: Build a Code Solution Comparator (Pairwise)
### Complete Example: Build a Code Solution Comparator (Pairwise)

Let's build a grader that compares two code implementations and determines which solution is better. This is useful for code review, interview assessment, or selecting the best implementation from multiple candidates.

Expand Down Expand Up @@ -394,7 +514,7 @@ GraderRank(
```


## Configuration Reference
## Iterative Rubric Configuration Reference

### Core Parameters

Expand Down Expand Up @@ -427,9 +547,29 @@ GraderRank(
- `LISTWISE_EVALUATION_TEMPLATE` — for ranking


---

## Choosing Between Simple and Iterative Rubric

| Scenario | Recommended Approach |
|----------|---------------------|
| Quick prototype, no data | **Simple Rubric** |
| Production grader with labeled data | **Iterative Rubric** |
| Well-defined task, need fast setup | **Simple Rubric** |
| Complex domain, implicit criteria | **Iterative Rubric** |
| < 50 labeled examples | **Simple Rubric** (or collect more data) |
| 50-100+ labeled examples | **Iterative Rubric** |

!!! tip "Workflow Recommendation"
1. Start with **Simple Rubric** for quick prototyping
2. Collect preference data during initial deployment
3. Upgrade to **Iterative Rubric** when you have 50+ labeled examples

---

## Tips

### Data Quality
### Data Quality (Iterative Rubric)

!!! tip "Good Practices"
- Clear preference signals (good vs. bad is obvious)
Expand All @@ -440,7 +580,20 @@ GraderRank(
- Ambiguous cases where labels are debatable
- Noisy or contradictory labels

### Parameter Tuning
### Task Description Quality (Simple Rubric)

!!! tip "Good Practices"
- Be specific about what your system does
- Include the target audience or use case
- Mention key quality dimensions you care about
- Provide representative sample queries

!!! warning "Avoid"
- Vague descriptions like "chatbot" or "assistant"
- Missing context about the domain
- No sample queries (the LLM needs examples)

### Parameter Tuning (Iterative Rubric)

| Goal | Recommended Settings |
|------|---------------------|
Expand Down
8 changes: 4 additions & 4 deletions docs/building_graders/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ Define evaluation logic using LLM judges or code-based functions with no trainin
**Learn more:** [Create Custom Graders →](create_custom_graders.md) | [Built-in Graders →](../built_in_graders/overview.md)


### Approach 2: Generate Graders from Data
### Approach 2: Generate Rubrics as Graders

Automatically analyze evaluation data to create structured scoring rubrics. Provide 50-500 labeled examples, and the generator extracts patterns to build interpretable criteria. Generated graders produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.
Automatically generate evaluation rubrics and create graders. Two approaches available: **Simple Rubric** generates rubrics from task descriptions (zero-shot, no data required), while **Iterative Rubric** learns from 50-500 labeled examples to extract patterns. Both produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.

**Learn more:** [Generate Graders from Data →](generate_graders_from_data.md)
**Learn more:** [Generate Rubrics as Graders →](generate_rubrics_as_graders.md)


### Approach 3: Train Reward Models
Expand All @@ -86,7 +86,7 @@ Train neural networks on preference data to learn evaluation criteria automatica
## Next Steps

- [Create Custom Graders](create_custom_graders.md) — Build graders using LLM or code-based logic
- [Generate Graders from Data](generate_graders_from_data.md) — Auto-generate rubrics from labeled data
- [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data
- [Train with GRPO](training_grpo.md) — Train generative judge models with reinforcement learning
- [Built-in Graders](../built_in_graders/overview.md) — Explore pre-built graders to customize
- [Run Grading Tasks](../running_graders/run_tasks.md) — Deploy graders at scale with batch workflows
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi

+ **Flexible Grader Building**: Choose the build method that fits your requirements:
- **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a>
- **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data. <a href="building_graders/generate_graders_from_data/" class="feature-link">Automatic Rubric Generation Tutorial <span class="link-arrow">→</span></a>
- **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. <a href="building_graders/generate_rubrics_as_graders/" class="feature-link">Generate Rubrics as Graders <span class="link-arrow">→</span></a>
- **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. <span class="badge-wip">🚧 Coming Soon</span>

+ **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!<span class="badge-wip">🚧 Coming Soon</span>
Expand Down Expand Up @@ -139,7 +139,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
</p>
</a>

<a href="building_graders/generate_graders_from_data/" class="feature-card-sm">
<a href="building_graders/generate_rubrics_as_graders/" class="feature-card-sm">
<div class="card-header">
<img src="https://unpkg.com/lucide-static@latest/icons/database.svg" class="card-icon card-icon-data">
<h3>Data-Driven Rubrics</h3>
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ nav:
- Building Graders:
- Overview: building_graders/overview.md
- Create Custom Graders: building_graders/create_custom_graders.md
- Generate Graders from Data: building_graders/generate_graders_from_data.md
- Generate Rubrics as Graders: building_graders/generate_rubrics_as_graders.md
# - Train Reward Models: building_graders/training/overview.md # Coming soon

- Running Graders:
Expand Down