From 180682ab5b450b8cb3d469306db4a9bd9208558e Mon Sep 17 00:00:00 2001 From: zhuohua Date: Fri, 9 Jan 2026 14:17:04 +0800 Subject: [PATCH 1/2] docs: update index page with new features and integrations - Update Train Judge Models card (remove WIP badge, add link) - Update Generate Rubrics card to cover both Zero-Shot and Data-Driven approaches - Add LangSmith and Langfuse integration cards (remove WIP badges) - Add Zero-Shot Evaluation to Quick Tutorials section - Reduce card header font size for better text fitting - Update Training Frameworks description --- README.md | 27 ++++++----- README_zh.md | 24 ++++++---- docs/index.md | 75 +++++++++++++++++++----------- docs/stylesheets/feature-cards.css | 4 +- 4 files changed, 79 insertions(+), 51 deletions(-) diff --git a/README.md b/README.md index f9f97400..01ecdaf1 100644 --- a/README.md +++ b/README.md @@ -97,14 +97,16 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo ### 🛠️ Flexible Grader Building Methods Choose the build method that fits your requirements: -* **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) -* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/) -* **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. +* **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. 👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) +* **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 [Zero-shot Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation) +* **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically +summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 [Data-driven Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation) +* **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 [Train Judge Models](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/) -### 🔌 Easy Integration (🚧 Coming Soon) +### 🔌 Easy Integration -We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned! → See [Integrations](#-integrations) +Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. 👉 See [Integrations](#-integrations) for details ---- ## News @@ -163,17 +165,18 @@ if __name__ == "__main__": ## 🔗 Integrations -Seamlessly connect OpenJudge with mainstream observability and training platforms, with more integrations on the way: +Seamlessly connect OpenJudge with mainstream observability and training platforms: -| Category | Status | Platforms | -|:---------|:------:|:----------| -| **Observability** | 🟡 In Progress | [LangSmith](https://smith.langchain.com/), [LangFuse](https://langfuse.com/), [Arize Phoenix](https://github.com/Arize-ai/phoenix) | -| **Training** | 🔵 Planned | [verl](https://github.com/volcengine/verl), [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | +| Category | Platform | Status | Documentation | +|:---------|:---------|:------:|:--------------| +| **Observability** | [LangSmith](https://smith.langchain.com/) | ✅ Available | 👉 [LangSmith Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langsmith/) | +| | [Langfuse](https://langfuse.com/) | ✅ Available | 👉 [Langfuse Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langfuse/) | +| | Other frameworks | 🔵 Planned | — | +| **Training** | [verl](https://github.com/volcengine/verl) | 🟡 In Progress | — | +| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 Planned | — | > 💬 Have a framework you'd like us to prioritize? [Open an Issue](https://github.com/modelscope/OpenJudge/issues)! - - --- ## 🤝 Contributing diff --git a/README_zh.md b/README_zh.md index accf7e28..83cf2905 100644 --- a/README_zh.md +++ b/README_zh.md @@ -97,14 +97,15 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口 ### 🛠️ 灵活的评分器构建方法 选择适合您需求的构建方法: -* **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) -* **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/) -* **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。 +* **自定义:** 需求明确但没有现成的评分器?如果您有明确的规则或逻辑,使用我们的 Python 接口或 Prompt 模板快速定义您自己的评分器。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) +* **零样本评估标准生成:** 不确定使用什么标准,也没有标注数据?只需提供任务描述和可选的示例查询,LLM 将自动为您生成评估标准。非常适合快速原型开发。👉 [零样本评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation) +* **数据驱动的评估标准生成:** 需求模糊但有少量样例?使用 GraderGenerator 从您的标注数据中自动总结评估标准,并生成基于 LLM 的评分器。👉 [数据驱动评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation) +* **训练评判模型:** 拥有大量数据且需要极致性能?使用我们的训练流程来训练专用的评判模型。适用于基于 Prompt 的评分无法满足的复杂场景。👉 [训练评判模型](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/) -### 🔌 轻松集成(🚧 即将推出) +### 🔌 轻松集成 -我们正在积极构建与主流可观测性平台和训练框架的无缝连接器。敬请期待!→ 查看 [集成](#-集成) +如果您正在使用主流可观测性平台(如 **LangSmith** 或 **Langfuse**),我们提供无缝集成方案,可增强平台的评测器和自动评测能力。我们也正在构建与训练框架(如 **verl**)的集成方案。👉 查看 [集成](#-集成) 了解详情 ---- ## 最新动态 @@ -163,12 +164,15 @@ if __name__ == "__main__": ## 🔗 集成 -无缝连接 OpenJudge 与主流可观测性和训练平台,更多集成即将推出: +无缝连接 OpenJudge 与主流可观测性和训练平台: -| 类别 | 状态 | 平台 | -|:---------|:------:|:----------| -| **可观测性** | 🟡 进行中 | [LangSmith](https://smith.langchain.com/)、[LangFuse](https://langfuse.com/)、[Arize Phoenix](https://github.com/Arize-ai/phoenix) | -| **训练** | 🔵 计划中 | [verl](https://github.com/volcengine/verl)、[Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | +| 类别 | 平台 | 状态 | 文档 | +|:---------|:---------|:------:|:--------------| +| **可观测性** | [LangSmith](https://smith.langchain.com/) | ✅ 可用 | 👉 [LangSmith 集成指南](https://modelscope.github.io/OpenJudge/integrations/langsmith/) | +| | [Langfuse](https://langfuse.com/) | ✅ 可用 | 👉 [Langfuse 集成指南](https://modelscope.github.io/OpenJudge/integrations/langfuse/) | +| | 其他框架 | 🔵 计划中 | — | +| **训练** | [verl](https://github.com/volcengine/verl) | 🟡 进行中 | — | +| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 计划中 | — | > 💬 有您希望我们优先支持的框架吗?[提交 Issue](https://github.com/modelscope/OpenJudge/issues)! diff --git a/docs/index.md b/docs/index.md index 6be0b039..b6716caa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -19,11 +19,12 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi - **Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. View Benchmark Datasets + **Flexible Grader Building**: Choose the build method that fits your requirements: - - **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. Custom Grader Development Guide - - **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. Generate Rubrics as Graders - - **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. 🚧 Coming Soon + - **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. Custom Grader Development Guide + - **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. Zero-shot Rubrics Generation Guide + - **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. Data-driven Rubrics Generation Guide + - **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. Train Judge Models -+ **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!🚧 Coming Soon ++ **Easy Integration**: Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. @@ -33,26 +34,38 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
- -
+ +
+ +

Zero-Shot Evaluation

+
+

+ Compare models without test data: Generate queries, collect responses, and rank via pairwise evaluation. +

+
+ + +

Evaluate An AI Agent

-

- Comprehensive evaluation for AI Agents: Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders. +

+ Agent lifecycle evaluation: Assess response, trajectory, tool usage, planning, memory, and reflection.

- - @@ -141,24 +154,23 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
- -

Data-Driven Rubrics

+ +

Generate Rubrics

- Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. + Auto-generate evaluation criteria. Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data.

- +
@@ -166,16 +178,25 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
- + + + +
+ +

Langfuse

+
+

+ Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation. +

+
@@ -184,7 +205,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi 🚧 Work in Progress

- Directly integrate into training loops such as VERL. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively. + Directly integrate into training loops such as VERL. Use Graders as high-quality reward functions for fine-tuning to align models effectively.

diff --git a/docs/stylesheets/feature-cards.css b/docs/stylesheets/feature-cards.css index 63280f17..845643b0 100644 --- a/docs/stylesheets/feature-cards.css +++ b/docs/stylesheets/feature-cards.css @@ -44,7 +44,7 @@ /* Three column cards */ .feature-card-sm { flex: 1 1 30%; - min-width: 250px; + min-width: 280px; text-decoration: none; color: inherit; border: 1px solid var(--md-default-fg-color--lightest, #e0e0e0); @@ -101,7 +101,7 @@ .card-header h3 { margin: 0 !important; - font-size: 16px; + font-size: 15px; font-weight: 600; white-space: nowrap !important; display: inline !important; From 73952b4b834d71c9b23cae33b637948c6a035e23 Mon Sep 17 00:00:00 2001 From: zhuohua Date: Fri, 9 Jan 2026 14:22:19 +0800 Subject: [PATCH 2/2] update for fix pre-commit error --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index b6716caa..c80ddc67 100644 --- a/docs/index.md +++ b/docs/index.md @@ -64,7 +64,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi

- +