-
Notifications
You must be signed in to change notification settings - Fork 13
docs: update index page with new features #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -97,14 +97,15 @@ OpenJudge 将评估指标和奖励信号统一为标准化的 **Grader** 接口 | |||||
|
|
||||||
| ### 🛠️ 灵活的评分器构建方法 | ||||||
| 选择适合您需求的构建方法: | ||||||
| * **自定义:** 轻松扩展或修改预定义的评分器以满足您的特定需求。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) | ||||||
| * **生成评估标准:** 需要评估标准但不想手动编写?使用 **Simple Rubric**(基于任务描述)或 **Iterative Rubric**(基于标注数据)自动生成白盒评估标准。👉 [生成评估标准作为 Grader](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/) | ||||||
| * **训练评判模型(即将推出🚀):** 对于大规模和专业化场景,我们正在开发训练专用评判模型的能力。SFT、Bradley-Terry 模型和强化学习工作流的支持即将推出,帮助您构建高性能、领域特定的评分器。 | ||||||
| * **自定义:** 需求明确但没有现成的评分器?如果您有明确的规则或逻辑,使用我们的 Python 接口或 Prompt 模板快速定义您自己的评分器。👉 [自定义评分器开发指南](https://modelscope.github.io/OpenJudge/building_graders/create_custom_graders/) | ||||||
| * **零样本评估标准生成:** 不确定使用什么标准,也没有标注数据?只需提供任务描述和可选的示例查询,LLM 将自动为您生成评估标准。非常适合快速原型开发。👉 [零样本评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation) | ||||||
| * **数据驱动的评估标准生成:** 需求模糊但有少量样例?使用 GraderGenerator 从您的标注数据中自动总结评估标准,并生成基于 LLM 的评分器。👉 [数据驱动评估标准生成指南](https://modelscope.github.io/OpenJudge/building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation) | ||||||
| * **训练评判模型:** 拥有大量数据且需要极致性能?使用我们的训练流程来训练专用的评判模型。适用于基于 Prompt 的评分无法满足的复杂场景。👉 [训练评判模型](https://modelscope.github.io/OpenJudge/building_graders/training_judge_models/) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better readability and consistency, it's good practice to add a space before the arrow emoji (👉).
Suggested change
|
||||||
|
|
||||||
|
|
||||||
| ### 🔌 轻松集成(🚧 即将推出) | ||||||
| ### 🔌 轻松集成 | ||||||
|
|
||||||
| 我们正在积极构建与主流可观测性平台和训练框架的无缝连接器。敬请期待!→ 查看 [集成](#-集成) | ||||||
| 如果您正在使用主流可观测性平台(如 **LangSmith** 或 **Langfuse**),我们提供无缝集成方案,可增强平台的评测器和自动评测能力。我们也正在构建与训练框架(如 **verl**)的集成方案。👉 查看 [集成](#-集成) 了解详情 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better readability and consistency, it's good practice to add a space before the arrow emoji (👉).
Suggested change
|
||||||
|
|
||||||
| ---- | ||||||
| ## 最新动态 | ||||||
|
|
@@ -163,12 +164,15 @@ if __name__ == "__main__": | |||||
|
|
||||||
| ## 🔗 集成 | ||||||
|
|
||||||
| 无缝连接 OpenJudge 与主流可观测性和训练平台,更多集成即将推出: | ||||||
| 无缝连接 OpenJudge 与主流可观测性和训练平台: | ||||||
|
|
||||||
| | 类别 | 状态 | 平台 | | ||||||
| |:---------|:------:|:----------| | ||||||
| | **可观测性** | 🟡 进行中 | [LangSmith](https://smith.langchain.com/)、[LangFuse](https://langfuse.com/)、[Arize Phoenix](https://github.com/Arize-ai/phoenix) | | ||||||
| | **训练** | 🔵 计划中 | [verl](https://github.com/volcengine/verl)、[Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | | ||||||
| | 类别 | 平台 | 状态 | 文档 | | ||||||
| |:---------|:---------|:------:|:--------------| | ||||||
| | **可观测性** | [LangSmith](https://smith.langchain.com/) | ✅ 可用 | 👉 [LangSmith 集成指南](https://modelscope.github.io/OpenJudge/integrations/langsmith/) | | ||||||
| | | [Langfuse](https://langfuse.com/) | ✅ 可用 | 👉 [Langfuse 集成指南](https://modelscope.github.io/OpenJudge/integrations/langfuse/) | | ||||||
| | | 其他框架 | 🔵 计划中 | — | | ||||||
| | **训练** | [verl](https://github.com/volcengine/verl) | 🟡 进行中 | — | | ||||||
| | | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | 🔵 计划中 | — | | ||||||
|
|
||||||
| > 💬 有您希望我们优先支持的框架吗?[提交 Issue](https://github.com/modelscope/OpenJudge/issues)! | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,11 +19,12 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi | |
| - **Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. <a href="https://huggingface.co/datasets/agentscope-ai/OpenJudge" class="feature-link" target="_blank"> View Benchmark Datasets<span class="link-arrow">→</span></a> | ||
|
|
||
| + **Flexible Grader Building**: Choose the build method that fits your requirements: | ||
| - **Customization:** Easily extend or modify pre-defined graders to fit your specific needs. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a> | ||
| - **Generate Rubrics:** Need evaluation criteria but don't want to write them manually? Use **Simple Rubric** (from task description) or **Iterative Rubric** (from labeled data) to automatically generate white-box evaluation rubrics. <a href="building_graders/generate_rubrics_as_graders/" class="feature-link">Generate Rubrics as Graders <span class="link-arrow">→</span></a> | ||
| - **Training Judge Models:** For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. <span class="badge-wip">🚧 Coming Soon</span> | ||
| - **Customization:** Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. <a href="building_graders/create_custom_graders/" class="feature-link">Custom Grader Development Guide <span class="link-arrow">→</span></a> | ||
| - **Zero-shot Rubrics Generation:** Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. <a href="building_graders/generate_rubrics_as_graders/#simple-rubric-zero-shot-generation" class="feature-link">Zero-shot Rubrics Generation Guide <span class="link-arrow">→</span></a> | ||
| - **Data-driven Rubrics Generation:** Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. <a href="building_graders/generate_rubrics_as_graders/#iterative-rubric-data-driven-generation" class="feature-link">Data-driven Rubrics Generation Guide <span class="link-arrow">→</span></a> | ||
| - **Training Judge Models:** Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. <a href="building_graders/training_judge_models/" class="feature-link">Train Judge Models <span class="link-arrow">→</span></a> | ||
|
|
||
| + **Easy Integration**: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!<span class="badge-wip">🚧 Coming Soon</span> | ||
| + **Easy Integration**: Using mainstream observability platforms like **LangSmith** or **Langfuse**? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like **verl**. | ||
|
|
||
| </div> | ||
|
|
||
|
|
@@ -33,26 +34,38 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi | |
|
|
||
| <div class="card-grid"> | ||
|
|
||
| <a href="get_started/evaluate_ai_agents/" class="feature-card"> | ||
| <div class="card-header card-header-lg"> | ||
| <a href="applications/zero_shot_evaluation/" class="feature-card-sm"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/zap.svg" class="card-icon card-icon-general"> | ||
| <h3>Zero-Shot Evaluation</h3> | ||
| </div> | ||
| <p class="card-desc"> | ||
| <b>Compare models without test data:</b> Generate queries, collect responses, and rank via pairwise evaluation. | ||
| </p> | ||
| </a> | ||
|
|
||
| <a href="get_started/evaluate_ai_agents/" class="feature-card-sm"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/bot.svg" class="card-icon card-icon-agent"> | ||
| <h3>Evaluate An AI Agent</h3> | ||
| </div> | ||
| <p class="card-desc card-desc-lg"> | ||
| <b>Comprehensive evaluation for AI Agents:</b> Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders. | ||
| <p class="card-desc"> | ||
| <b>Agent lifecycle evaluation:</b> Assess response, trajectory, tool usage, planning, memory, and reflection. | ||
| </p> | ||
| </a> | ||
|
|
||
| <a href="get_started/build_reward/" class="feature-card"> | ||
| <div class="card-header card-header-lg"> | ||
| <a href="get_started/build_reward/" class="feature-card-sm"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/brain-circuit.svg" class="card-icon card-icon-tool"> | ||
| <h3>Build Rewards for Training</h3> | ||
| </div> | ||
| <p class="card-desc card-desc-lg"> | ||
| <b>Construct High-Quality Reward Signals:</b> Create robust reward functions for model and agent alignment by aggregating diverse graders with custom weighting and high-concurrency support. | ||
| <p class="card-desc"> | ||
| <b>Quality reward signals:</b> Aggregate graders with custom weighting for model alignment. | ||
| </p> | ||
| </a> | ||
|
|
||
|
|
||
|
|
||
|
Comment on lines
+67
to
+68
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| </div> | ||
|
|
||
|
|
||
|
|
@@ -141,41 +154,49 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi | |
|
|
||
| <a href="building_graders/generate_rubrics_as_graders/" class="feature-card-sm"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/database.svg" class="card-icon card-icon-data"> | ||
| <h3>Data-Driven Rubrics</h3> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/sparkles.svg" class="card-icon card-icon-data"> | ||
| <h3>Generate Rubrics</h3> | ||
| </div> | ||
| <p class="card-desc"> | ||
| <b>Ambiguous requirements, but have few examples?</b> Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. | ||
| <b>Auto-generate evaluation criteria.</b> Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data. | ||
| </p> | ||
| </a> | ||
|
|
||
| <div class="feature-card-wip"> | ||
| <a href="building_graders/training_judge_models/" class="feature-card-sm"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/scale.svg" class="card-icon card-icon-integration"> | ||
| <h3>Trainable Judge Model</h3> | ||
| <h3>Train Judge Models</h3> | ||
| </div> | ||
| <span class="badge-wip">🚧 Work in Progress</span> | ||
| <p class="card-desc"> | ||
| <b>Massive data and need peak performance?</b> Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. | ||
| <b>Massive data and need peak performance?</b> Train dedicated judge models using SFT, Bradley-Terry, or GRPO. Supports both scalar rewards and generative evaluation with reasoning. | ||
| </p> | ||
| </div> | ||
| </a> | ||
|
|
||
| </div> | ||
|
|
||
| ### Integrations | ||
|
|
||
| <div class="card-grid"> | ||
|
|
||
| <div class="feature-card-wip"> | ||
| <a href="integrations/langsmith/" class="feature-card"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/bar-chart-3.svg" class="card-icon card-icon-integration"> | ||
| <h3>Evaluation Frameworks</h3> | ||
| <span class="badge-wip">🚧 Work in Progress</span> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/telescope.svg" class="card-icon card-icon-integration"> | ||
| <h3>LangSmith</h3> | ||
| </div> | ||
| <p class="card-desc"> | ||
| Seamlessly connect with mainstream platforms like <strong>LangSmith</strong> and <strong>LangFuse</strong>. Streamline your evaluation pipelines and monitor agent performance with flexible APIs. | ||
| Build external evaluation pipelines for LangSmith. Wrap OpenJudge graders as LangSmith evaluators and run batch evaluations with GradingRunner. | ||
| </p> | ||
| </div> | ||
| </a> | ||
|
|
||
| <a href="integrations/langfuse/" class="feature-card"> | ||
| <div class="card-header"> | ||
| <img src="https://unpkg.com/lucide-static@latest/icons/activity.svg" class="card-icon card-icon-data"> | ||
| <h3>Langfuse</h3> | ||
| </div> | ||
| <p class="card-desc"> | ||
| Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation. | ||
| </p> | ||
| </a> | ||
|
|
||
| <div class="feature-card-wip"> | ||
| <div class="card-header"> | ||
|
|
@@ -184,7 +205,7 @@ OpenJudge unifies evaluation metrics and reward signals into a single, standardi | |
| <span class="badge-wip">🚧 Work in Progress</span> | ||
| </div> | ||
| <p class="card-desc"> | ||
| Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively. | ||
| Directly integrate into training loops such as <strong>VERL</strong>. Use Graders as high-quality reward functions for fine-tuning to align models effectively. | ||
| </p> | ||
| </div> | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability and consistency, it's good practice to add a space before the arrow emoji (👉).