You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/building_graders/overview.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Building Custom Graders
2
2
3
-
Extend OpenJudge beyond built-in evaluators by creating custom graders or training reward models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
3
+
Extend OpenJudge beyond built-in evaluators by creating custom graders or training judge models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
4
4
5
5
6
6
## Why Build Custom Graders?
@@ -17,7 +17,7 @@ OpenJudge supports three paths for creating custom graders, each optimized for d
|**Generate from Data**| 1-4 hours | 50-500 examples | Iterative refinement, transparent rubrics | Medium setup + pay-per-query |
20
-
|**Train Reward Models**| 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
20
+
|**Train Judge Models**| 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
21
21
22
22
Use this decision tree to choose the right approach based on your data availability and requirements:
23
23
@@ -57,7 +57,7 @@ Use this decision tree to choose the right approach based on your data availabil
57
57
58
58
**Choose based on your situation:**
59
59
60
-
-**Have labeled data + need automation?** → Train a reward model
60
+
-**Have labeled data + need automation?** → Train a judge model
61
61
-**Have data + need fast iteration?** → Generate rubrics from data
62
62
-**No data + need immediate results?** → Create custom graders
63
63
@@ -75,19 +75,19 @@ Automatically generate evaluation rubrics and create graders. Two approaches ava
75
75
**Learn more:**[Generate Rubrics as Graders →](generate_rubrics_as_graders.md)
76
76
77
77
78
-
### Approach 3: Train Reward Models
78
+
### Approach 3: Train Judge Models
79
79
80
80
Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.
Copy file name to clipboardExpand all lines: docs/get_started/build_reward.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -259,7 +259,7 @@ asyncio.run(main())
259
259
260
260
Running this code evaluates both responses across three quality dimensions and produces a training reward for each. These rewards can then feed into RLHF or DPO algorithms to optimize your chatbot. The output shows individual dimension scores alongside the final aggregated reward, helping you understand what drives the training signal.
261
261
262
-
You now have a foundation for building reward models. Start with a single grader to validate your setup, then progressively add more dimensions as needed. The key is choosing graders that align with your application's requirements and weighting them appropriately based on what matters most for your use case.
262
+
You now have a foundation for building judge models. Start with a single grader to validate your setup, then progressively add more dimensions as needed. The key is choosing graders that align with your application's requirements and weighting them appropriately based on what matters most for your use case.
|**Trajectory**| Multi-step reasoning paths and efficiency | Cost optimization, training reward models |
25
+
|**Trajectory**| Multi-step reasoning paths and efficiency | Cost optimization, training judge models |
26
26
27
27
!!! tip "Evaluation Strategy"
28
28
Start with **Final Response** evaluation to establish baseline success rates. When failures occur, use **Single Step** evaluation to pinpoint root causes. Use **Trajectory** evaluation to detect systemic issues like loops or inefficiencies.
0 commit comments