Releases: agentscope-ai/OpenJudge
v0.2.2
🚀 OpenJudge v0.2.2 Changelog
This release introduces major UI enhancements, expanded evaluation capabilities for vertical domains (Finance and Academia), and a significant architectural refactor using Executor/Strategy patterns. We have also standardized prompt templates across the board.
✨ New Features
🌐 Online Playground
Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:
- Test graders interactively: Select a built-in grader, input your data, and see results instantly.
- Build custom rubrics: Use the zero-shot generator to create graders from task descriptions.
- View leaderboards: Compare model performance across evaluation benchmarks at openjudge.me/leaderboard.
📊 Domain-Specific & Advanced Evaluation
- Finance Domain: Added specialized graders for stock, event, industry, and macro-economic analysis (#117).
- Academic Paper Review: Introduced a Paper Review cookbook and a dedicated UI for systematic academic analysis (#87, #101).
- Agentic Evaluation: Added
AgenticGrader,SearchCorrectnessGrader(with tool-call support), and Trajectory Accuracy graders (#82, #102). - Multi-turn & Arena: Support for comprehensive multi-turn conversation evaluation and a new Reference Hallucination Arena (#114, #120).
🖥️ UI & User Experience
- Streamlit-based Grader UI: A new interactive interface for evaluating and managing graders (#71).
- Auto Rubric: Automatically generate grading criteria to simplify the evaluation setup (#92).
- UX Improvements: Redesigned sidebar layout, simplified theme styles, and improved the Grader Generator workflow (#103, #109).
- Analytics: Added single evaluation logging for deeper data insights (#110).
⚙️ Core Architecture
- Executor & Strategy Patterns: Implemented a more flexible backend architecture to handle diverse evaluation workflows (#97, #96).
- Experimentation: Added support for running grader evaluation experiments directly on datasets (#95).
🛠️ Refactoring & Optimization
- XML Prompt Standardization: Standardized prompt templates into XML tag formats for General, Agent, and Multimodal graders to improve LLM parsing reliability (#105, #108, #112).
- Reasoning-First Scoring: Reordered JSON schemas to force models to provide Reasoning before the Score, enhancing the stability and Chain-of-Thought (CoT) performance (#113).
- Pipeline Decoupling: Refactored Graders and Runners to align with the new Strategy/Executor framework (#99, #100).
🐞 Bug Fixes
- Environment Stability: Fixed and pinned version requirements for
openaiandstreamlit(#79, #90). - Paper Review Pipeline: Fixed dynamic date injection, score display issues, and added
GraderErrorhandling (#88, #91). - UI Rendering: Resolved issues with history result rendering (#94).
- Logic Fixes: Corrected and formatted the Trajectory Accuracy grader logic (#107, #104).
📝 Documentation & Deployment
- Docker: Added comprehensive Docker installation guides for both OpenJudge and the training environment (#84, #111).
- Technical Docs: Updated documentation regarding Strategy, Executor, and system integration (#98).
- Project News: Updated README with the latest features including Paper Review and Auto Arena (#89).
Detailed Changelog
- Fix openai version requirement. by @weizhang25 in #79
- feat: add paper review cookbook for academic paper analysis by @XiaoBoAI in #87
- fix(paper_review): add dynamic date injection and fix score display by @XiaoBoAI in #88
- feat(ui): Add Streamlit-based grader evaluation UI by @XiaoBoAI in #71
- Fix/pin streamlit version by @XiaoBoAI in #90
- docs: add Docker installation guide for OpenJudge and training enviro… by @XieLipeng0830 in #84
- fix(paper_review): add GraderError handling in pipeline by @XiaoBoAI in #91
- docs(readme): add news for Paper Review, OpenJudge UI, and Auto Arena by @XiaoBoAI in #89
- feat: Update LLMGrader and schema by @jc200808 in #80
- fix: history result ui render by @weidankong in #94
- feat(ui): Add Auto Rubric feature for automatic grading criteria gene… by @XiaoBoAI in #92
- Feat/UI paper review by @XiaoBoAI in #101
- Feature/add trajectory accuracy grader by @helloml0326 in #102
- feature (executor): implement executor patterns by @ployts in #97
- format trajectory_accuracy_grader by @helloml0326 in #104
- feat(ui): improve Grader Generator feature with better UX by @XieLipeng0830 in #103
- refactor(graders): standardize prompt template format for common graders by @XiaoBoAI in #105
- fix trajectory_accuracy_grader by @helloml0326 in #107
- Chore/update dependencies by @XiaoBoAI in #106
- refactor(graders/agent): standardize prompts to XML tag format by @XieLipeng0830 in #108
- refactor(ui): simplify theme styles and improve sidebar layout by @XiaoBoAI in #109
- feat(ui): add single evaluation logging for analytics by @XiaoBoAI in #110
- feature(strategy): implement evaluation strategy by @ployts in #96
- feat: update docker file for judge model post training by @jc200808 in #111
- refactor(grader): refactor graders for strategy/executor by @ployts in #99
- refactor(runner): runner for grader&executor by @ployts in #100
- docs (strategy/executor/integration): update related docs for new version by @ployts in #98
- feat: add grader evaluation experiments on datasets by @jc200808 in #95
- refactor(graders): reorder JSON schema to reason before score by @XiaoBoAI in #113
- feat: add finance graders for stock/event/industry/macro analysis by @XieLipeng0830 in #117
- feat: Add multi-turn conversation graders with comprehensive evaluati… by @XieLipeng0830 in #114
- refactor(graders/multimodal): standardize prompts to XML tag format by @XiaoBoAI in #112
- Feat/ref hallucination arena by @XiaoBoAI in #120
- feat: add AgenticGrader and SearchCorrectnessGrader with tool support by @XieLipeng0830 in #82
Full Changelog: v0.2.1...v0.2.2
v0.2.1
Changelog
Integrations & Ecosystem
-
VERL Integration: Added an integration guide for VERL with async reward evaluation support.
-
Observability: Added documentation and cookbooks for Langfuse and LangSmith integrations.
Grader Improvements
-
Standardized Scoring: Adjusted the score range for multimodal graders from [0, 1] to a 1-5 scale for better granularity.
-
Tool Call Evaluation:
-
Added ToolCallPrecisionRecallMatchGrader.
-
Renamed ToolCallSequenceMatchGrader to ToolCallStepSequenceMatchGrader and added metric_type parameters.
-
-
Terminology Alignment: Renamed "Reward Model" to "Judge Model" across documentation and code to better reflect its role in the evaluation ecosystem.
-
Refining Logic: Standardized parameter naming, response parsing, and improved streaming support across various graders.
Evaluation Pipeline
-
Zero-shot Evaluation Pipeline: Added a comprehensive zero-shot evaluation workflow, including win rate chart generators, report generators, and rerun-judge capabilities.
-
GRPO Enhancements: Improved code style for GRPO, added a dedicated report generator, and updated training documentation for judge models.
-
Rubric System: Introduced and enhanced Simple Rubric and Iterative Rubric generation and evaluation modules.
-
Dataset Updates: Updated ChatRLDataset class and associated logic to better support reinforcement learning workflows.
What's Changed
- Feature/zero shot evaluation by @XiaoBoAI in #28
- docs: add DingTalk community group QR code to README by @helloml0326 in #30
- docs: add docs/integration/langfuse.md by @helloml0326 in #31
- refactor(grpo): improve code style and add report generator by @XiaoBoAI in #37
- feat(integration): add LangSmith integration cookbook and documentation by @ployts in #33
- Minor code refactoring, including None checks and argument name changes. by @weizhang25 in #34
- feat(template): add GitHub PR template and issue templates by @ployts in #36
- Docs/sample reports by @XiaoBoAI in #38
- Rename grader validator file name from base.py to grader_validator.py by @weizhang25 in #41
- A common util method of formatting history for agent graders. by @weizhang25 in #42
- feat: update agent graders by @jc200808 in #35
- docs: add Simple Rubric documentation and rename to Generate Rubrics … by @XieLipeng0830 in #45
- feat: add report generator and update zero-shot evaluation pipeline by @XiaoBoAI in #32
- docs(building_graders): add training reward models guide and update integrations by @XiaoBoAI in #46
- chore(test): remove redundant multimodal graders syntax test by @XiaoBoAI in #43
- feat: update example code and template processing in agent and common graders by @jc200808 in #49
- Use compiled regex objects instead of raw pattern strings. by @weizhang25 in #48
- Docs/rename reward to judge model by @XiaoBoAI in #47
- feat(security): add pre-commit hooks for secret detection by @XiaoBoAI in #50
- refactor(graders): standardize parameter naming and response parsing by @ployts in #44
- docs: update index page with new features by @helloml0326 in #51
- fix(graders): fix typo in code_execution filename and imports by @XiaoBoAI in #54
- Set the name of a customized llm grader. by @weizhang25 in #52
- refactor(graders,models): cleanup and improve code quality by @XiaoBoAI in #55
- Add OpenJudge integration guide for VERL with async reward evaluation. by @chr6192 in #53
- refactor(graders): fix deprecated parameters and improve input validation by @XiaoBoAI in #56
- feat(zero_shot): add win rate chart generator by @XiaoBoAI in #57
- A utility function that collects all grader information. by @weizhang25 in #58
- fix(graders): modify threshold value by @weidankong in #59
- feat: update more graders including multimodal graders by @jc200808 in #60
- refactor(analyzer): update consistency analyzer and agent grader tests by @ployts in #61
- refactor(graders): improve parameter validation and streaming support by @XiaoBoAI in #62
- feat(grader): add metric_type parameter to ToolCallSequenceMatchGrader by @helloml0326 in #64
- fix: fix a typo in a comment by @Wangzy455 in #66
- feat: update ChatRLDataset class and judge model grpo training document by @jc200808 in #67
- Update get_all_grader_info logic. by @weizhang25 in #65
- feat(zero_shot): enhance evaluation pipeline with rerun-judge and cha… by @XiaoBoAI in #63
- chore: migrate links from modelscope to agentscope-ai org by @XieLipeng0830 in #73
- refactor(multimodal): change score range from [0,1] to 1-5 scale by @XiaoBoAI in #75
- feat(iterative_rubric): enhance rubric generation and evaluation by @XieLipeng0830 in #72
- Update grader info util. by @weizhang25 in #74
- docs: improve README with comprehensive examples and update simple ru… by @helloml0326 in #76
- Docs/update verl integration status by @helloml0326 in #77
- feat: update base/openai chat model module by @jc200808 in #70
- Feature/add ToolCallPrecisionRecallMatchGrader & rename: ToolCallSequenceMatchGrader ->ToolCallStepSequenceMatchGrader by @helloml0326 in #68
New Contributors
- @Wangzy455 made their first contribution in #66
Full Changelog: https://github.com/agentscope-ai/OpenJudge/commits/v0.2.1