Skip to content

Commit 2f6beee

Browse files
authored
docs: update index page with new features (#51)
* docs: update index page with new features and integrations - Update Train Judge Models card (remove WIP badge, add link) - Update Generate Rubrics card to cover both Zero-Shot and Data-Driven approaches - Add LangSmith and Langfuse integration cards (remove WIP badges) - Add Zero-Shot Evaluation to Quick Tutorials section - Reduce card header font size for better text fitting - Update Training Frameworks description * update for fix pre-commit error
1 parent d9cdc75 commit 2f6beee

File tree

4 files changed

+79
-51
lines changed

4 files changed

+79
-51
lines changed

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -97,14 +97,16 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo
9797

9898
### 🛠️ Flexible Grader Building Methods
9999
Choose the build method that fits your requirements:
100-
* **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101-
* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
102-
* **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders.
100+
* **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101+
* **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 [Zero-shot Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation)
102+
* **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically
103+
summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 [Data-driven Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation)
104+
* **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 [Train Judge Models](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/)
103105

104106

105-
### 🔌 Easy Integration (🚧 Coming Soon)
107+
### 🔌 Easy Integration
106108

107-
We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned! → See [Integrations](#-integrations)
109+
Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. 👉 See [Integrations](#-integrations) for details
108110

109111
----
110112
## News
@@ -163,17 +165,18 @@ if __name__ == "__main__":
163165

164166
## 🔗 Integrations
165167

166-
Seamlessly connect OpenJudge with mainstream observability and training platforms, with more integrations on the way:
168+
Seamlessly connect OpenJudge with mainstream observability and training platforms:
167169

168-
| Category | Status | Platforms |
169-
|:---------|:------:|:----------|
170-
| **Observability** | 🟡 In Progress | [LangSmith](https://smith.langchain.com/), [LangFuse](https://langfuse.com/), [Arize Phoenix](https://github.com/Arize-ai/phoenix) |
171-
| **Training** | 🔵 Planned | [verl](https://github.com/volcengine/verl), [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) |
170+
| Category | Platform | Status | Documentation |
171+
|:---------|:---------|:------:|:--------------|
172+
| **Observability** | [LangSmith](https://smith.langchain.com/) | ✅ Available | 👉 [LangSmith Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langsmith/) |
173+
| | [Langfuse](https://langfuse.com/) | ✅ Available | 👉 [Langfuse Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langfuse/) |
174+
| | Other frameworks | 🔵 Planned ||
175+
| **Training** | [verl](https://github.com/volcengine/verl) | 🟡 In Progress ||
176+
| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 Planned ||
172177

173178
> 💬 Have a framework you'd like us to prioritize? [Open an Issue](https://github.com/modelscope/OpenJudge/issues)!
174179
175-
176-
177180
---
178181

179182
## 🤝 Contributing

README_zh.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -97,14 +97,15 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口
9797

9898
### 🛠️ 灵活的评分器构建方法
9999
选择适合您需求的构建方法:
100-
* **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101-
* **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
102-
* **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。
100+
* **自定义:** 需求明确但没有现成的评分器?如果您有明确的规则或逻辑,使用我们的 Python 接口或 Prompt 模板快速定义您自己的评分器。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
101+
* **零样本评估标准生成:** 不确定使用什么标准,也没有标注数据?只需提供任务描述和可选的示例查询,LLM 将自动为您生成评估标准。非常适合快速原型开发。👉 [零样本评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation)
102+
* **数据驱动的评估标准生成:** 需求模糊但有少量样例?使用 GraderGenerator 从您的标注数据中自动总结评估标准,并生成基于 LLM 的评分器。👉 [数据驱动评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation)
103+
* **训练评判模型:** 拥有大量数据且需要极致性能?使用我们的训练流程来训练专用的评判模型。适用于基于 Prompt 的评分无法满足的复杂场景。👉 [训练评判模型](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/)
103104

104105

105-
### 🔌 轻松集成(🚧 即将推出)
106+
### 🔌 轻松集成
106107

107-
我们正在积极构建与主流可观测性平台和训练框架的无缝连接器。敬请期待!→ 查看 [集成](#-集成)
108+
如果您正在使用主流可观测性平台(如 **LangSmith****Langfuse**),我们提供无缝集成方案,可增强平台的评测器和自动评测能力。我们也正在构建与训练框架(如 **verl**)的集成方案。👉 查看 [集成](#-集成) 了解详情
108109

109110
----
110111
## 最新动态
@@ -163,12 +164,15 @@ if __name__ == "__main__":
163164

164165
## 🔗 集成
165166

166-
无缝连接 OpenJudge 与主流可观测性和训练平台,更多集成即将推出
167+
无缝连接 OpenJudge 与主流可观测性和训练平台:
167168

168-
| 类别 | 状态 | 平台 |
169-
|:---------|:------:|:----------|
170-
| **可观测性** | 🟡 进行中 | [LangSmith](https://smith.langchain.com/)[LangFuse](https://langfuse.com/)[Arize Phoenix](https://github.com/Arize-ai/phoenix) |
171-
| **训练** | 🔵 计划中 | [verl](https://github.com/volcengine/verl)[Trinity-RFT](https://github.com/modelscope/Trinity-RFT) |
169+
| 类别 | 平台 | 状态 | 文档 |
170+
|:---------|:---------|:------:|:--------------|
171+
| **可观测性** | [LangSmith](https://smith.langchain.com/) | ✅ 可用 | 👉 [LangSmith 集成指南](https://modelscope.github.io/OpenJudge/integrations/langsmith/) |
172+
| | [Langfuse](https://langfuse.com/) | ✅ 可用 | 👉 [Langfuse 集成指南](https://modelscope.github.io/OpenJudge/integrations/langfuse/) |
173+
| | 其他框架 | 🔵 计划中 ||
174+
| **训练** | [verl](https://github.com/volcengine/verl) | 🟡 进行中 ||
175+
| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 计划中 ||
172176

173177
> 💬 有您希望我们优先支持的框架吗?[提交 Issue](https://github.com/modelscope/OpenJudge/issues)
174178

docs/index.md

Lines changed: 48 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,12 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
1919
- **Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. <a href="https://huggingface.co/datasets/agentscope-ai/OpenJudge" class="feature-link" target="_blank"> View Benchmark Datasets<span class="link-arrow">→</span></a>
2020

2121
+ **Flexible Grader Building**: Choose the build method that fits your requirements:
22-
- **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a>
23-
- **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. <a href="building_graders/generate_rubrics_as_graders/" class="feature-link">Generate Rubrics as Graders <span class="link-arrow">→</span></a>
24-
- **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. <span class="badge-wip">🚧 Coming Soon</span>
22+
- **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a>
23+
- **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. <a href="building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation" class="feature-link">Zero-shot Rubrics Generation Guide <span class="link-arrow">→</span></a>
24+
- **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. <a href="building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation" class="feature-link">Data-driven Rubrics Generation Guide <span class="link-arrow">→</span></a>
25+
- **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. <a href="building_graders/training_judge_models/" class="feature-link">Train Judge Models <span class="link-arrow">→</span></a>
2526

26-
+ **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!<span class="badge-wip">🚧 Coming Soon</span>
27+
+ **Easy Integration**: Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**.
2728

2829
</div>
2930

@@ -33,26 +34,38 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
3334

3435
<div class="card-grid">
3536

36-
<a href="get_started/evaluate_ai_agents/" class="feature-card">
37-
<div class="card-header card-header-lg">
37+
<a href="applications/zero_shot_evaluation/" class="feature-card-sm">
38+
<div class="card-header">
39+
<img src="https://unpkg.com/lucide-static@latest/icons/zap.svg" class="card-icon card-icon-general">
40+
<h3>Zero-Shot Evaluation</h3>
41+
</div>
42+
<p class="card-desc">
43+
<b>Compare models without test data:</b> Generate queries, collect responses, and rank via pairwise evaluation.
44+
</p>
45+
</a>
46+
47+
<a href="get_started/evaluate_ai_agents/" class="feature-card-sm">
48+
<div class="card-header">
3849
<img src="https://unpkg.com/lucide-static@latest/icons/bot.svg" class="card-icon card-icon-agent">
3950
<h3>Evaluate An AI Agent</h3>
4051
</div>
41-
<p class="card-desc card-desc-lg">
42-
<b>Comprehensive evaluation for AI Agents:</b> Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders.
52+
<p class="card-desc">
53+
<b>Agent lifecycle evaluation:</b> Assess response, trajectory, tool usage, planning, memory, and reflection.
4354
</p>
4455
</a>
4556

46-
<a href="get_started/build_reward/" class="feature-card">
47-
<div class="card-header card-header-lg">
57+
<a href="get_started/build_reward/" class="feature-card-sm">
58+
<div class="card-header">
4859
<img src="https://unpkg.com/lucide-static@latest/icons/brain-circuit.svg" class="card-icon card-icon-tool">
4960
<h3>Build Rewards for Training</h3>
5061
</div>
51-
<p class="card-desc card-desc-lg">
52-
<b>Construct High-Quality Reward Signals:</b> Create robust reward functions for model and agent alignment by aggregating diverse graders with custom weighting and high-concurrency support.
62+
<p class="card-desc">
63+
<b>Quality reward signals:</b> Aggregate graders with custom weighting for model alignment.
5364
</p>
5465
</a>
5566

67+
68+
5669
</div>
5770

5871

@@ -141,41 +154,49 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
141154

142155
<a href="building_graders/generate_rubrics_as_graders/" class="feature-card-sm">
143156
<div class="card-header">
144-
<img src="https://unpkg.com/lucide-static@latest/icons/database.svg" class="card-icon card-icon-data">
145-
<h3>Data-Driven Rubrics</h3>
157+
<img src="https://unpkg.com/lucide-static@latest/icons/sparkles.svg" class="card-icon card-icon-data">
158+
<h3>Generate Rubrics</h3>
146159
</div>
147160
<p class="card-desc">
148-
<b>Ambiguous requirements, but have few examples?</b> Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader.
161+
<b>Auto-generate evaluation criteria.</b> Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data.
149162
</p>
150163
</a>
151164

152-
<div class="feature-card-wip">
165+
<a href="building_graders/training_judge_models/" class="feature-card-sm">
153166
<div class="card-header">
154167
<img src="https://unpkg.com/lucide-static@latest/icons/scale.svg" class="card-icon card-icon-integration">
155-
<h3>Trainable Judge Model</h3>
168+
<h3>Train Judge Models</h3>
156169
</div>
157-
<span class="badge-wip">🚧 Work in Progress</span>
158170
<p class="card-desc">
159-
<b>Massive data and need peak performance?</b> Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.
171+
<b>Massive data and need peak performance?</b> Train dedicated judge models using SFT, Bradley-Terry, or GRPO. Supports both scalar rewards and generative evaluation with reasoning.
160172
</p>
161-
</div>
173+
</a>
162174

163175
</div>
164176

165177
### Integrations
166178

167179
<div class="card-grid">
168180

169-
<div class="feature-card-wip">
181+
<a href="integrations/langsmith/" class="feature-card">
170182
<div class="card-header">
171-
<img src="https://unpkg.com/lucide-static@latest/icons/bar-chart-3.svg" class="card-icon card-icon-integration">
172-
<h3>Evaluation Frameworks</h3>
173-
<span class="badge-wip">🚧 Work in Progress</span>
183+
<img src="https://unpkg.com/lucide-static@latest/icons/telescope.svg" class="card-icon card-icon-integration">
184+
<h3>LangSmith</h3>
174185
</div>
175186
<p class="card-desc">
176-
Seamlessly connect with mainstream platforms like <strong>LangSmith</strong> and <strong>LangFuse</strong>. Streamline your evaluation pipelines and monitor agent performance with flexible APIs.
187+
Build external evaluation pipelines for LangSmith. Wrap OpenJudge graders as LangSmith evaluators and run batch evaluations with GradingRunner.
177188
</p>
178-
</div>
189+
</a>
190+
191+
<a href="integrations/langfuse/" class="feature-card">
192+
<div class="card-header">
193+
<img src="https://unpkg.com/lucide-static@latest/icons/activity.svg" class="card-icon card-icon-data">
194+
<h3>Langfuse</h3>
195+
</div>
196+
<p class="card-desc">
197+
Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation.
198+
</p>
199+
</a>
179200

180201
<div class="feature-card-wip">
181202
<div class="card-header">
@@ -184,7 +205,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
184205
<span class="badge-wip">🚧 Work in Progress</span>
185206
</div>
186207
<p class="card-desc">
187-
Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively.
208+
Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for fine-tuning to align models effectively.
188209
</p>
189210
</div>
190211

docs/stylesheets/feature-cards.css

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
/* Three column cards */
4545
.feature-card-sm {
4646
flex: 1 1 30%;
47-
min-width: 250px;
47+
min-width: 280px;
4848
text-decoration: none;
4949
color: inherit;
5050
border: 1px solid var(--md-default-fg-color--lightest, #e0e0e0);
@@ -101,7 +101,7 @@
101101

102102
.card-header h3 {
103103
margin: 0 !important;
104-
font-size: 16px;
104+
font-size: 15px;
105105
font-weight: 600;
106106
white-space: nowrap !important;
107107
display: inline !important;

0 commit comments

Comments
 (0)