docs: update index page with new features (#51)

helloml0326 · web-flow · commit 2f6beee6ae68 · 2026-01-09T15:15:49.000+08:00
* docs: update index page with new features and integrations

- Update Train Judge Models card (remove WIP badge, add link)
- Update Generate Rubrics card to cover both Zero-Shot and Data-Driven approaches
- Add LangSmith and Langfuse integration cards (remove WIP badges)
- Add Zero-Shot Evaluation to Quick Tutorials section
- Reduce card header font size for better text fitting
- Update Training Frameworks description

* update for fix pre-commit error
diff --git a/README.md b/README.md
@@ -97,14 +97,16 @@ Access **50+ production-ready graders** featuring a comprehensive taxonomy, rigo
 
 ### 🛠️ Flexible Grader Building Methods
 Choose the build method that fits your requirements:
-* **Customization:** Easily extend or modify pre-defined graders to fit your specific needs.  👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
-* **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. 👉 [Generate Rubrics as Graders](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
-* **Training Judge Models ( Coming Soon🚀):** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders.
+* **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader.  👉 [Custom Grader Development Guide](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
+* **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 [Zero-shot Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation)
+* **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically
+summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 [Data-driven Rubrics Generation Guide](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation)
+* **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 [Train Judge Models](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/)
 
 
-### 🔌 Easy Integration (🚧 Coming Soon)
+### 🔌 Easy Integration
 
-We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned! → See [Integrations](#-integrations)
+Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. 👉 See [Integrations](#-integrations) for details
 
 ----
 ## News
@@ -163,17 +165,18 @@ if __name__ == "__main__":
 
 ## 🔗 Integrations
 
-Seamlessly connect OpenJudge with mainstream observability and training platforms, with more integrations on the way:
+Seamlessly connect OpenJudge with mainstream observability and training platforms:
 
-| Category | Status | Platforms |
-|:---------|:------:|:----------|
-| **Observability** | 🟡 In Progress | [LangSmith](https://smith.langchain.com/), [LangFuse](https://langfuse.com/), [Arize Phoenix](https://github.com/Arize-ai/phoenix) |
-| **Training** | 🔵 Planned | [verl](https://github.com/volcengine/verl), [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) |
+| Category | Platform | Status | Documentation |
+|:---------|:---------|:------:|:--------------|
+| **Observability** | [LangSmith](https://smith.langchain.com/) | ✅ Available | 👉 [LangSmith Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langsmith/) |
+| | [Langfuse](https://langfuse.com/) | ✅ Available | 👉 [Langfuse Integration Guide](https://modelscope.github.io/OpenJudge/integrations/langfuse/) |
+| | Other frameworks | 🔵 Planned | — |
+| **Training** | [verl](https://github.com/volcengine/verl) | 🟡 In Progress | — |
+| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 Planned | — |
 
 > 💬 Have a framework you'd like us to prioritize? [Open an Issue](https://github.com/modelscope/OpenJudge/issues)!
 
-
-
 ---
 
 ## 🤝 Contributing
diff --git a/README_zh.md b/README_zh.md
@@ -97,14 +97,15 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口
 
 ### 🛠️ 灵活的评分器构建方法
 选择适合您需求的构建方法：
-* **自定义：** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
-* **生成评估标准：** 需要评估标准但不想手动编写？使用 **Simple Rubric**（基于任务描述）或 **Iterative Rubric**（基于标注数据）自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/)
-* **训练评判模型（即将推出🚀）：** 对于大规模和专业化场景，我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出，帮助您构建高性能、领域特定的评分器。
+* **自定义：** 需求明确但没有现成的评分器？如果您有明确的规则或逻辑，使用我们的 Python 接口或 Prompt 模板快速定义您自己的评分器。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/)
+* **零样本评估标准生成：** 不确定使用什么标准，也没有标注数据？只需提供任务描述和可选的示例查询，LLM 将自动为您生成评估标准。非常适合快速原型开发。👉 [零样本评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation)
+* **数据驱动的评估标准生成：** 需求模糊但有少量样例？使用 GraderGenerator 从您的标注数据中自动总结评估标准，并生成基于 LLM 的评分器。👉 [数据驱动评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation)
+* **训练评判模型：** 拥有大量数据且需要极致性能？使用我们的训练流程来训练专用的评判模型。适用于基于 Prompt 的评分无法满足的复杂场景。👉 [训练评判模型](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/)
 
 
-### 🔌 轻松集成（🚧 即将推出）
+### 🔌 轻松集成
 
-我们正在积极构建与主流可观测性平台和训练框架的无缝连接器。敬请期待！→ 查看 [集成](#-集成)
+如果您正在使用主流可观测性平台（如 **LangSmith** 或 **Langfuse**），我们提供无缝集成方案，可增强平台的评测器和自动评测能力。我们也正在构建与训练框架（如 **verl**）的集成方案。👉 查看 [集成](#-集成) 了解详情
 
 ----
 ## 最新动态
@@ -163,12 +164,15 @@ if __name__ == "__main__":
 
 ## 🔗 集成
 
-无缝连接 OpenJudge 与主流可观测性和训练平台，更多集成即将推出：
+无缝连接 OpenJudge 与主流可观测性和训练平台：
 
-| 类别 | 状态 | 平台 |
-|:---------|:------:|:----------|
-| **可观测性** | 🟡 进行中 | [LangSmith](https://smith.langchain.com/)、[LangFuse](https://langfuse.com/)、[Arize Phoenix](https://github.com/Arize-ai/phoenix) |
-| **训练** | 🔵 计划中 | [verl](https://github.com/volcengine/verl)、[Trinity-RFT](https://github.com/modelscope/Trinity-RFT) |
+| 类别 | 平台 | 状态 | 文档 |
+|:---------|:---------|:------:|:--------------|
+| **可观测性** | [LangSmith](https://smith.langchain.com/) | ✅ 可用 | 👉 [LangSmith 集成指南](https://modelscope.github.io/OpenJudge/integrations/langsmith/) |
+| | [Langfuse](https://langfuse.com/) | ✅ 可用 | 👉 [Langfuse 集成指南](https://modelscope.github.io/OpenJudge/integrations/langfuse/) |
+| | 其他框架 | 🔵 计划中 | — |
+| **训练** | [verl](https://github.com/volcengine/verl) | 🟡 进行中 | — |
+| | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 计划中 | — |
 
 > 💬 有您希望我们优先支持的框架吗？[提交 Issue](https://github.com/modelscope/OpenJudge/issues)！
 
diff --git a/docs/index.md b/docs/index.md
@@ -19,11 +19,12 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
     - **Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. <a href="https://huggingface.co/datasets/agentscope-ai/OpenJudge" class="feature-link" target="_blank"> View Benchmark Datasets<span class="link-arrow">→</span></a>
 
 + **Flexible Grader Building**: Choose the build method that fits your requirements:
-    - **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a>
-    - **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. <a href="building_graders/generate_rubrics_as_graders/" class="feature-link">Generate Rubrics as Graders <span class="link-arrow">→</span></a>
-    - **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. <span class="badge-wip">🚧 Coming Soon</span>
+    - **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a>
+    - **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. <a href="building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation" class="feature-link">Zero-shot Rubrics Generation Guide <span class="link-arrow">→</span></a>
+    - **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. <a href="building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation" class="feature-link">Data-driven Rubrics Generation Guide <span class="link-arrow">→</span></a>
+    - **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. <a href="building_graders/training_judge_models/" class="feature-link">Train Judge Models <span class="link-arrow">→</span></a>
 
-+ **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!<span class="badge-wip">🚧 Coming Soon</span>
++ **Easy Integration**: Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**.
 
 </div>
 
@@ -33,26 +34,38 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
 
 <div class="card-grid">
 
-  <a href="get_started/evaluate_ai_agents/" class="feature-card">
-    <div class="card-header card-header-lg">
+  <a href="applications/zero_shot_evaluation/" class="feature-card-sm">
+    <div class="card-header">
+      <img src="https://unpkg.com/lucide-static@latest/icons/zap.svg" class="card-icon card-icon-general">
+      <h3>Zero-Shot Evaluation</h3>
+    </div>
+    <p class="card-desc">
+      <b>Compare models without test data:</b> Generate queries, collect responses, and rank via pairwise evaluation.
+    </p>
+  </a>
+
+  <a href="get_started/evaluate_ai_agents/" class="feature-card-sm">
+    <div class="card-header">
       <img src="https://unpkg.com/lucide-static@latest/icons/bot.svg" class="card-icon card-icon-agent">
       <h3>Evaluate An AI Agent</h3>
     </div>
-    <p class="card-desc card-desc-lg">
-      <b>Comprehensive evaluation for AI Agents:</b> Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders.
+    <p class="card-desc">
+      <b>Agent lifecycle evaluation:</b> Assess response, trajectory, tool usage, planning, memory, and reflection.
     </p>
   </a>
 
-  <a href="get_started/build_reward/" class="feature-card">
-    <div class="card-header card-header-lg">
+  <a href="get_started/build_reward/" class="feature-card-sm">
+    <div class="card-header">
       <img src="https://unpkg.com/lucide-static@latest/icons/brain-circuit.svg" class="card-icon card-icon-tool">
       <h3>Build Rewards for Training</h3>
     </div>
-    <p class="card-desc card-desc-lg">
-      <b>Construct High-Quality Reward Signals:</b> Create robust reward functions for model and agent alignment by aggregating diverse graders with custom weighting and high-concurrency support.
+    <p class="card-desc">
+      <b>Quality reward signals:</b> Aggregate graders with custom weighting for model alignment.
     </p>
   </a>
 
+
+
 </div>
 
 
@@ -141,41 +154,49 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
 
   <a href="building_graders/generate_rubrics_as_graders/" class="feature-card-sm">
     <div class="card-header">
-      <img src="https://unpkg.com/lucide-static@latest/icons/database.svg" class="card-icon card-icon-data">
-      <h3>Data-Driven Rubrics</h3>
+      <img src="https://unpkg.com/lucide-static@latest/icons/sparkles.svg" class="card-icon card-icon-data">
+      <h3>Generate Rubrics</h3>
     </div>
     <p class="card-desc">
-      <b>Ambiguous requirements, but have few examples?</b> Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader.
+      <b>Auto-generate evaluation criteria.</b> Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data.
     </p>
   </a>
 
-  <div class="feature-card-wip">
+  <a href="building_graders/training_judge_models/" class="feature-card-sm">
     <div class="card-header">
       <img src="https://unpkg.com/lucide-static@latest/icons/scale.svg" class="card-icon card-icon-integration">
-      <h3>Trainable Judge Model</h3>
+      <h3>Train Judge Models</h3>
     </div>
-    <span class="badge-wip">🚧 Work in Progress</span>
     <p class="card-desc">
-      <b>Massive data and need peak performance?</b> Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.
+      <b>Massive data and need peak performance?</b> Train dedicated judge models using SFT, Bradley-Terry, or GRPO. Supports both scalar rewards and generative evaluation with reasoning.
     </p>
-  </div>
+  </a>
 
 </div>
 
 ### Integrations
 
 <div class="card-grid">
 
-  <div class="feature-card-wip">
+  <a href="integrations/langsmith/" class="feature-card">
     <div class="card-header">
-      <img src="https://unpkg.com/lucide-static@latest/icons/bar-chart-3.svg" class="card-icon card-icon-integration">
-      <h3>Evaluation Frameworks</h3>
-      <span class="badge-wip">🚧 Work in Progress</span>
+      <img src="https://unpkg.com/lucide-static@latest/icons/telescope.svg" class="card-icon card-icon-integration">
+      <h3>LangSmith</h3>
     </div>
     <p class="card-desc">
-      Seamlessly connect with mainstream platforms like <strong>LangSmith</strong> and <strong>LangFuse</strong>. Streamline your evaluation pipelines and monitor agent performance with flexible APIs.
+      Build external evaluation pipelines for LangSmith. Wrap OpenJudge graders as LangSmith evaluators and run batch evaluations with GradingRunner.
     </p>
-  </div>
+  </a>
+
+  <a href="integrations/langfuse/" class="feature-card">
+    <div class="card-header">
+      <img src="https://unpkg.com/lucide-static@latest/icons/activity.svg" class="card-icon card-icon-data">
+      <h3>Langfuse</h3>
+    </div>
+    <p class="card-desc">
+      Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation.
+    </p>
+  </a>
 
   <div class="feature-card-wip">
     <div class="card-header">
@@ -184,7 +205,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi
       <span class="badge-wip">🚧 Work in Progress</span>
     </div>
     <p class="card-desc">
-      Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively.
+      Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for fine-tuning to align models effectively.
     </p>
   </div>
 
diff --git a/docs/stylesheets/feature-cards.css b/docs/stylesheets/feature-cards.css
@@ -44,7 +44,7 @@
 /* Three column cards */
 .feature-card-sm {
   flex: 1 1 30%;
-  min-width: 250px;
+  min-width: 280px;
   text-decoration: none;
   color: inherit;
   border: 1px solid var(--md-default-fg-color--lightest, #e0e0e0);
@@ -101,7 +101,7 @@
 
 .card-header h3 {
   margin: 0 !important;
-  font-size: 16px;
+  font-size: 15px;
   font-weight: 600;
   white-space: nowrap !important;
   display: inline !important;