You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: update index page with new features and integrations
- Update Train Judge Models card (remove WIP badge, add link)
- Update Generate Rubrics card to cover both Zero-Shot and Data-Driven approaches
- Add LangSmith and Langfuse integration cards (remove WIP badges)
- Add Zero-Shot Evaluation to Quick Tutorials section
- Reduce card header font size for better text fitting
- Update Training Frameworks description
* update for fix pre-commit error
Choose the build method that fits your requirements:
100
-
***Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101
-
***Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
102
-
***Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders.
100
+
***Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101
+
***Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 [Zero-shot Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation)
102
+
***Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically
103
+
summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 [Data-driven Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation)
104
+
***Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 [Train Judge Models](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/)
103
105
104
106
105
-
### 🔌 Easy Integration (🚧 Coming Soon)
107
+
### 🔌 Easy Integration
106
108
107
-
We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned! → See [Integrations](#-integrations)
109
+
Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. 👉 See [Integrations](#-integrations) for details
108
110
109
111
----
110
112
## News
@@ -163,17 +165,18 @@ if __name__ == "__main__":
163
165
164
166
## 🔗 Integrations
165
167
166
-
Seamlessly connect OpenJudge with mainstream observability and training platforms, with more integrations on the way:
168
+
Seamlessly connect OpenJudge with mainstream observability and training platforms:
167
169
168
-
| Category | Status | Platforms |
169
-
|:---------|:------:|:----------|
170
-
|**Observability**| 🟡 In Progress |[LangSmith](https://smith.langchain.com/), [LangFuse](https://langfuse.com/), [Arize Phoenix](https://github.com/Arize-ai/phoenix)|
Copy file name to clipboardExpand all lines: docs/index.md
+48-27Lines changed: 48 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,11 +19,12 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
19
19
-**Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. <ahref="https://huggingface.co/datasets/agentscope-ai/OpenJudge"class="feature-link"target="_blank"> View Benchmark Datasets<spanclass="link-arrow">→</span></a>
20
20
21
21
+**Flexible Grader Building**: Choose the build method that fits your requirements:
22
-
-**Customization:** Easily extend or modify pre-defined graders to fit your specific needs. <ahref="building_graders/create_custom_graders/"class="feature-link">Custom Grader Development Guide <spanclass="link-arrow">→</span></a>
23
-
-**Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. <ahref="building_graders/generate_rubrics_as_graders/"class="feature-link">Generate Rubrics as Graders <spanclass="link-arrow">→</span></a>
24
-
-**Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. <spanclass="badge-wip">🚧 Coming Soon</span>
22
+
-**Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. <ahref="building_graders/create_custom_graders/"class="feature-link">Custom Grader Development Guide <spanclass="link-arrow">→</span></a>
23
+
-**Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. <ahref="building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation"class="feature-link">Zero-shot Rubrics Generation Guide <spanclass="link-arrow">→</span></a>
24
+
-**Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. <ahref="building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation"class="feature-link">Data-driven Rubrics Generation Guide <spanclass="link-arrow">→</span></a>
25
+
-**Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. <ahref="building_graders/training_judge_models/"class="feature-link">Train Judge Models <spanclass="link-arrow">→</span></a>
25
26
26
-
+**Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!<spanclass="badge-wip">🚧 Coming Soon</span>
27
+
+**Easy Integration**: Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**.
27
28
28
29
</div>
29
30
@@ -33,26 +34,38 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
<b>Comprehensive evaluation for AI Agents:</b> Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders.
<b>Construct High-Quality Reward Signals:</b> Create robust reward functions for model and agent alignment by aggregating diverse graders with custom weighting and high-concurrency support.
62
+
<p class="card-desc">
63
+
<b>Quality reward signals:</b> Aggregate graders with custom weighting for model alignment.
53
64
</p>
54
65
</a>
55
66
67
+
68
+
56
69
</div>
57
70
58
71
@@ -141,41 +154,49 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
<b>Ambiguous requirements, but have few examples?</b> Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader.
161
+
<b>Auto-generate evaluation criteria.</b> Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data.
<b>Massive data and need peak performance?</b> Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.
171
+
<b>Massive data and need peak performance?</b> Train dedicated judge models using SFT, Bradley-Terry, or GRPO. Supports both scalar rewards and generative evaluation with reasoning.
Seamlessly connect with mainstream platforms like <strong>LangSmith</strong> and <strong>LangFuse</strong>. Streamline your evaluation pipelines and monitor agent performance with flexible APIs.
187
+
Build external evaluation pipelines for LangSmith. Wrap OpenJudge graders as LangSmith evaluators and run batch evaluations with GradingRunner.
Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation.
198
+
</p>
199
+
</a>
179
200
180
201
<divclass="feature-card-wip">
181
202
<div class="card-header">
@@ -184,7 +205,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
184
205
<span class="badge-wip">🚧 Work in Progress</span>
185
206
</div>
186
207
<p class="card-desc">
187
-
Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively.
208
+
Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for fine-tuning to align models effectively.
0 commit comments