Skip to content

Releases: agentscope-ai/OpenJudge

v0.2.2

12 Feb 06:46
33db7c4

Choose a tag to compare

🚀 OpenJudge v0.2.2 Changelog

This release introduces major UI enhancements, expanded evaluation capabilities for vertical domains (Finance and Academia), and a significant architectural refactor using Executor/Strategy patterns. We have also standardized prompt templates across the board.


✨ New Features

🌐 Online Playground

Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:

  • Test graders interactively: Select a built-in grader, input your data, and see results instantly.
  • Build custom rubrics: Use the zero-shot generator to create graders from task descriptions.
  • View leaderboards: Compare model performance across evaluation benchmarks at openjudge.me/leaderboard.

📊 Domain-Specific & Advanced Evaluation

  • Finance Domain: Added specialized graders for stock, event, industry, and macro-economic analysis (#117).
  • Academic Paper Review: Introduced a Paper Review cookbook and a dedicated UI for systematic academic analysis (#87, #101).
  • Agentic Evaluation: Added AgenticGrader, SearchCorrectnessGrader (with tool-call support), and Trajectory Accuracy graders (#82, #102).
  • Multi-turn & Arena: Support for comprehensive multi-turn conversation evaluation and a new Reference Hallucination Arena (#114, #120).

🖥️ UI & User Experience

  • Streamlit-based Grader UI: A new interactive interface for evaluating and managing graders (#71).
  • Auto Rubric: Automatically generate grading criteria to simplify the evaluation setup (#92).
  • UX Improvements: Redesigned sidebar layout, simplified theme styles, and improved the Grader Generator workflow (#103, #109).
  • Analytics: Added single evaluation logging for deeper data insights (#110).

⚙️ Core Architecture

  • Executor & Strategy Patterns: Implemented a more flexible backend architecture to handle diverse evaluation workflows (#97, #96).
  • Experimentation: Added support for running grader evaluation experiments directly on datasets (#95).

🛠️ Refactoring & Optimization

  • XML Prompt Standardization: Standardized prompt templates into XML tag formats for General, Agent, and Multimodal graders to improve LLM parsing reliability (#105, #108, #112).
  • Reasoning-First Scoring: Reordered JSON schemas to force models to provide Reasoning before the Score, enhancing the stability and Chain-of-Thought (CoT) performance (#113).
  • Pipeline Decoupling: Refactored Graders and Runners to align with the new Strategy/Executor framework (#99, #100).

🐞 Bug Fixes

  • Environment Stability: Fixed and pinned version requirements for openai and streamlit (#79, #90).
  • Paper Review Pipeline: Fixed dynamic date injection, score display issues, and added GraderError handling (#88, #91).
  • UI Rendering: Resolved issues with history result rendering (#94).
  • Logic Fixes: Corrected and formatted the Trajectory Accuracy grader logic (#107, #104).

📝 Documentation & Deployment

  • Docker: Added comprehensive Docker installation guides for both OpenJudge and the training environment (#84, #111).
  • Technical Docs: Updated documentation regarding Strategy, Executor, and system integration (#98).
  • Project News: Updated README with the latest features including Paper Review and Auto Arena (#89).

Detailed Changelog

  • Fix openai version requirement. by @weizhang25 in #79
  • feat: add paper review cookbook for academic paper analysis by @XiaoBoAI in #87
  • fix(paper_review): add dynamic date injection and fix score display by @XiaoBoAI in #88
  • feat(ui): Add Streamlit-based grader evaluation UI by @XiaoBoAI in #71
  • Fix/pin streamlit version by @XiaoBoAI in #90
  • docs: add Docker installation guide for OpenJudge and training enviro… by @XieLipeng0830 in #84
  • fix(paper_review): add GraderError handling in pipeline by @XiaoBoAI in #91
  • docs(readme): add news for Paper Review, OpenJudge UI, and Auto Arena by @XiaoBoAI in #89
  • feat: Update LLMGrader and schema by @jc200808 in #80
  • fix: history result ui render by @weidankong in #94
  • feat(ui): Add Auto Rubric feature for automatic grading criteria gene… by @XiaoBoAI in #92
  • Feat/UI paper review by @XiaoBoAI in #101
  • Feature/add trajectory accuracy grader by @helloml0326 in #102
  • feature (executor): implement executor patterns by @ployts in #97
  • format trajectory_accuracy_grader by @helloml0326 in #104
  • feat(ui): improve Grader Generator feature with better UX by @XieLipeng0830 in #103
  • refactor(graders): standardize prompt template format for common graders by @XiaoBoAI in #105
  • fix trajectory_accuracy_grader by @helloml0326 in #107
  • Chore/update dependencies by @XiaoBoAI in #106
  • refactor(graders/agent): standardize prompts to XML tag format by @XieLipeng0830 in #108
  • refactor(ui): simplify theme styles and improve sidebar layout by @XiaoBoAI in #109
  • feat(ui): add single evaluation logging for analytics by @XiaoBoAI in #110
  • feature(strategy): implement evaluation strategy by @ployts in #96
  • feat: update docker file for judge model post training by @jc200808 in #111
  • refactor(grader): refactor graders for strategy/executor by @ployts in #99
  • refactor(runner): runner for grader&executor by @ployts in #100
  • docs (strategy/executor/integration): update related docs for new version by @ployts in #98
  • feat: add grader evaluation experiments on datasets by @jc200808 in #95
  • refactor(graders): reorder JSON schema to reason before score by @XiaoBoAI in #113
  • feat: add finance graders for stock/event/industry/macro analysis by @XieLipeng0830 in #117
  • feat: Add multi-turn conversation graders with comprehensive evaluati… by @XieLipeng0830 in #114
  • refactor(graders/multimodal): standardize prompts to XML tag format by @XiaoBoAI in #112
  • Feat/ref hallucination arena by @XiaoBoAI in #120
  • feat: add AgenticGrader and SearchCorrectnessGrader with tool support by @XieLipeng0830 in #82

Full Changelog: v0.2.1...v0.2.2

v0.2.1

21 Jan 06:34
595f1fd

Choose a tag to compare

Changelog

Integrations & Ecosystem

  • VERL Integration: Added an integration guide for VERL with async reward evaluation support.

  • Observability: Added documentation and cookbooks for Langfuse and LangSmith integrations.

Grader Improvements

  • Standardized Scoring: Adjusted the score range for multimodal graders from [0, 1] to a 1-5 scale for better granularity.

  • Tool Call Evaluation:

    • Added ToolCallPrecisionRecallMatchGrader.

    • Renamed ToolCallSequenceMatchGrader to ToolCallStepSequenceMatchGrader and added metric_type parameters.

  • Terminology Alignment: Renamed "Reward Model" to "Judge Model" across documentation and code to better reflect its role in the evaluation ecosystem.

  • Refining Logic: Standardized parameter naming, response parsing, and improved streaming support across various graders.

Evaluation Pipeline

  • Zero-shot Evaluation Pipeline: Added a comprehensive zero-shot evaluation workflow, including win rate chart generators, report generators, and rerun-judge capabilities.

  • GRPO Enhancements: Improved code style for GRPO, added a dedicated report generator, and updated training documentation for judge models.

  • Rubric System: Introduced and enhanced Simple Rubric and Iterative Rubric generation and evaluation modules.

  • Dataset Updates: Updated ChatRLDataset class and associated logic to better support reinforcement learning workflows.

What's Changed

  • Feature/zero shot evaluation by @XiaoBoAI in #28
  • docs: add DingTalk community group QR code to README by @helloml0326 in #30
  • docs: add docs/integration/langfuse.md by @helloml0326 in #31
  • refactor(grpo): improve code style and add report generator by @XiaoBoAI in #37
  • feat(integration): add LangSmith integration cookbook and documentation by @ployts in #33
  • Minor code refactoring, including None checks and argument name changes. by @weizhang25 in #34
  • feat(template): add GitHub PR template and issue templates by @ployts in #36
  • Docs/sample reports by @XiaoBoAI in #38
  • Rename grader validator file name from base.py to grader_validator.py by @weizhang25 in #41
  • A common util method of formatting history for agent graders. by @weizhang25 in #42
  • feat: update agent graders by @jc200808 in #35
  • docs: add Simple Rubric documentation and rename to Generate Rubrics … by @XieLipeng0830 in #45
  • feat: add report generator and update zero-shot evaluation pipeline by @XiaoBoAI in #32
  • docs(building_graders): add training reward models guide and update integrations by @XiaoBoAI in #46
  • chore(test): remove redundant multimodal graders syntax test by @XiaoBoAI in #43
  • feat: update example code and template processing in agent and common graders by @jc200808 in #49
  • Use compiled regex objects instead of raw pattern strings. by @weizhang25 in #48
  • Docs/rename reward to judge model by @XiaoBoAI in #47
  • feat(security): add pre-commit hooks for secret detection by @XiaoBoAI in #50
  • refactor(graders): standardize parameter naming and response parsing by @ployts in #44
  • docs: update index page with new features by @helloml0326 in #51
  • fix(graders): fix typo in code_execution filename and imports by @XiaoBoAI in #54
  • Set the name of a customized llm grader. by @weizhang25 in #52
  • refactor(graders,models): cleanup and improve code quality by @XiaoBoAI in #55
  • Add OpenJudge integration guide for VERL with async reward evaluation. by @chr6192 in #53
  • refactor(graders): fix deprecated parameters and improve input validation by @XiaoBoAI in #56
  • feat(zero_shot): add win rate chart generator by @XiaoBoAI in #57
  • A utility function that collects all grader information. by @weizhang25 in #58
  • fix(graders): modify threshold value by @weidankong in #59
  • feat: update more graders including multimodal graders by @jc200808 in #60
  • refactor(analyzer): update consistency analyzer and agent grader tests by @ployts in #61
  • refactor(graders): improve parameter validation and streaming support by @XiaoBoAI in #62
  • feat(grader): add metric_type parameter to ToolCallSequenceMatchGrader by @helloml0326 in #64
  • fix: fix a typo in a comment by @Wangzy455 in #66
  • feat: update ChatRLDataset class and judge model grpo training document by @jc200808 in #67
  • Update get_all_grader_info logic. by @weizhang25 in #65
  • feat(zero_shot): enhance evaluation pipeline with rerun-judge and cha… by @XiaoBoAI in #63
  • chore: migrate links from modelscope to agentscope-ai org by @XieLipeng0830 in #73
  • refactor(multimodal): change score range from [0,1] to 1-5 scale by @XiaoBoAI in #75
  • feat(iterative_rubric): enhance rubric generation and evaluation by @XieLipeng0830 in #72
  • Update grader info util. by @weizhang25 in #74
  • docs: improve README with comprehensive examples and update simple ru… by @helloml0326 in #76
  • Docs/update verl integration status by @helloml0326 in #77
  • feat: update base/openai chat model module by @jc200808 in #70
  • Feature/add ToolCallPrecisionRecallMatchGrader & rename: ToolCallSequenceMatchGrader ->ToolCallStepSequenceMatchGrader by @helloml0326 in #68

New Contributors

Full Changelog: https://github.com/agentscope-ai/OpenJudge/commits/v0.2.1