docs: rename reward model to judge model for consistency

XiaoBoAI · XiaoBoAI · commit 6dc69c489ed1 · 2026-01-08T19:23:32.000+08:00
diff --git a/docs/building_graders/generate_rubrics_as_graders.md b/docs/building_graders/generate_rubrics_as_graders.md
@@ -177,7 +177,7 @@ Learn evaluation rubrics from labeled preference data. Based on [Auto-Rubric: Le
 1. **Infer query-specific rubrics** — For each labeled example, the system proposes criteria that explain why one response is better than another
 2. **Generalize to core set** — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure
 
-**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained reward models.
+**Data efficiency:** Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained judge models.
 
 <figure markdown="span">
   ![Auto-Rubric Pipeline Overview](../images/auto_rubric_overview.png){ width="100%" }
diff --git a/docs/building_graders/overview.md b/docs/building_graders/overview.md
@@ -1,6 +1,6 @@
 # Building Custom Graders
 
-Extend OpenJudge beyond built-in evaluators by creating custom graders or training reward models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
+Extend OpenJudge beyond built-in evaluators by creating custom graders or training judge models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
 
 
 ## Why Build Custom Graders?
@@ -17,7 +17,7 @@ OpenJudge supports three paths for creating custom graders, each optimized for d
 |----------|---------------|---------------|----------|--------------|
 | **Create Custom Graders** | Minutes | None | Quick prototyping, domain-specific logic | Pay-per-query (API) or free (code-based) |
 | **Generate from Data** | 1-4 hours | 50-500 examples | Iterative refinement, transparent rubrics | Medium setup + pay-per-query |
-| **Train Reward Models** | 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
+| **Train Judge Models** | 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
 
 Use this decision tree to choose the right approach based on your data availability and requirements:
 
@@ -57,7 +57,7 @@ Use this decision tree to choose the right approach based on your data availabil
 
 **Choose based on your situation:**
 
-- **Have labeled data + need automation?** → Train a reward model
+- **Have labeled data + need automation?** → Train a judge model
 - **Have data + need fast iteration?** → Generate rubrics from data
 - **No data + need immediate results?** → Create custom graders
 
@@ -75,19 +75,19 @@ Automatically generate evaluation rubrics and create graders. Two approaches ava
 **Learn more:** [Generate Rubrics as Graders →](generate_rubrics_as_graders.md)
 
 
-### Approach 3: Train Reward Models
+### Approach 3: Train Judge Models
 
 Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.
 
-**Learn more:** [Train Reward Models →](training_reward_models.md)
+**Learn more:** [Train Judge Models →](training_judge_models.md)
 
 
 
 ## Next Steps
 
 - [Create Custom Graders](create_custom_graders.md) — Build graders using LLM or code-based logic
 - [Generate Rubrics as Graders](generate_rubrics_as_graders.md) — Automatically generate graders from task description or labeled data
-- [Train Reward Models](training_reward_models.md) — Train SFT, Bradley-Terry, or GRPO judge models
+- [Train Judge Models](training_judge_models.md) — Train SFT, Bradley-Terry, or GRPO judge models
 - [Built-in Graders](../built_in_graders/overview.md) — Explore pre-built graders to customize
 - [Run Grading Tasks](../running_graders/run_tasks.md) — Deploy graders at scale with batch workflows
 
diff --git a/docs/community/contributing.md b/docs/community/contributing.md
@@ -1,6 +1,6 @@
 # Contribute to OpenJudge
 
-Welcome! OpenJudge is an open-source reward model platform. Your contributions help make AI alignment and evaluation more accessible to the community.
+Welcome! OpenJudge is an open-source judge model platform. Your contributions help make AI alignment and evaluation more accessible to the community.
 
 !!! info "Ways to Contribute"
     We welcome contributions of all kinds:
diff --git a/docs/get_started/build_reward.md b/docs/get_started/build_reward.md
@@ -259,7 +259,7 @@ asyncio.run(main())
 
 Running this code evaluates both responses across three quality dimensions and produces a training reward for each. These rewards can then feed into RLHF or DPO algorithms to optimize your chatbot. The output shows individual dimension scores alongside the final aggregated reward, helping you understand what drives the training signal.
 
-You now have a foundation for building reward models. Start with a single grader to validate your setup, then progressively add more dimensions as needed. The key is choosing graders that align with your application's requirements and weighting them appropriately based on what matters most for your use case.
+You now have a foundation for building judge models. Start with a single grader to validate your setup, then progressively add more dimensions as needed. The key is choosing graders that align with your application's requirements and weighting them appropriately based on what matters most for your use case.
 
 
 ## Explore More Graders
diff --git a/docs/get_started/evaluate_ai_agents.md b/docs/get_started/evaluate_ai_agents.md
@@ -22,7 +22,7 @@ AI agents operate autonomously through complex reasoning loops, making multiple
 |-------------|------------------|-------------|
 | **Final Response** | Overall task success and answer quality | Production monitoring, A/B testing |
 | **Single Step** | Individual action quality (tool calls, planning) | Debugging failures, prompt engineering |
-| **Trajectory** | Multi-step reasoning paths and efficiency | Cost optimization, training reward models |
+| **Trajectory** | Multi-step reasoning paths and efficiency | Cost optimization, training judge models |
 
 !!! tip "Evaluation Strategy"
     Start with **Final Response** evaluation to establish baseline success rates. When failures occur, use **Single Step** evaluation to pinpoint root causes. Use **Trajectory** evaluation to detect systemic issues like loops or inefficiencies.