12 Feb 06:46

33db7c4

v0.2.2 Latest

Latest

🚀 OpenJudge v0.2.2 Changelog

This release introduces major UI enhancements, expanded evaluation capabilities for vertical domains (Finance and Academia), and a significant architectural refactor using Executor/Strategy patterns. We have also standardized prompt templates across the board.

✨ New Features

🌐 Online Playground

Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:

Test graders interactively: Select a built-in grader, input your data, and see results instantly.
Build custom rubrics: Use the zero-shot generator to create graders from task descriptions.
View leaderboards: Compare model performance across evaluation benchmarks at openjudge.me/leaderboard.

📊 Domain-Specific & Advanced Evaluation

Finance Domain: Added specialized graders for stock, event, industry, and macro-economic analysis (#117).
Academic Paper Review: Introduced a Paper Review cookbook and a dedicated UI for systematic academic analysis (#87, #101).
Agentic Evaluation: Added AgenticGrader, SearchCorrectnessGrader (with tool-call support), and Trajectory Accuracy graders (#82, #102).
Multi-turn & Arena: Support for comprehensive multi-turn conversation evaluation and a new Reference Hallucination Arena (#114, #120).

🖥️ UI & User Experience

Streamlit-based Grader UI: A new interactive interface for evaluating and managing graders (#71).
Auto Rubric: Automatically generate grading criteria to simplify the evaluation setup (#92).
UX Improvements: Redesigned sidebar layout, simplified theme styles, and improved the Grader Generator workflow (#103, #109).
Analytics: Added single evaluation logging for deeper data insights (#110).

⚙️ Core Architecture

Executor & Strategy Patterns: Implemented a more flexible backend architecture to handle diverse evaluation workflows (#97, #96).
Experimentation: Added support for running grader evaluation experiments directly on datasets (#95).

🛠️ Refactoring & Optimization

XML Prompt Standardization: Standardized prompt templates into XML tag formats for General, Agent, and Multimodal graders to improve LLM parsing reliability (#105, #108, #112).
Reasoning-First Scoring: Reordered JSON schemas to force models to provide Reasoning before the Score, enhancing the stability and Chain-of-Thought (CoT) performance (#113).
Pipeline Decoupling: Refactored Graders and Runners to align with the new Strategy/Executor framework (#99, #100).

🐞 Bug Fixes

Environment Stability: Fixed and pinned version requirements for openai and streamlit (#79, #90).
Paper Review Pipeline: Fixed dynamic date injection, score display issues, and added GraderError handling (#88, #91).
UI Rendering: Resolved issues with history result rendering (#94).
Logic Fixes: Corrected and formatted the Trajectory Accuracy grader logic (#107, #104).

📝 Documentation & Deployment

Docker: Added comprehensive Docker installation guides for both OpenJudge and the training environment (#84, #111).
Technical Docs: Updated documentation regarding Strategy, Executor, and system integration (#98).
Project News: Updated README with the latest features including Paper Review and Auto Arena (#89).

Detailed Changelog

Fix openai version requirement. by @weizhang25 in #79
feat: add paper review cookbook for academic paper analysis by @XiaoBoAI in #87
fix(paper_review): add dynamic date injection and fix score display by @XiaoBoAI in #88
feat(ui): Add Streamlit-based grader evaluation UI by @XiaoBoAI in #71
Fix/pin streamlit version by @XiaoBoAI in #90
docs: add Docker installation guide for OpenJudge and training enviro… by @XieLipeng0830 in #84
fix(paper_review): add GraderError handling in pipeline by @XiaoBoAI in #91
docs(readme): add news for Paper Review, OpenJudge UI, and Auto Arena by @XiaoBoAI in #89
feat: Update LLMGrader and schema by @jc200808 in #80
fix: history result ui render by @weidankong in #94
feat(ui): Add Auto Rubric feature for automatic grading criteria gene… by @XiaoBoAI in #92
Feat/UI paper review by @XiaoBoAI in #101
Feature/add trajectory accuracy grader by @helloml0326 in #102
feature (executor): implement executor patterns by @ployts in #97
format trajectory_accuracy_grader by @helloml0326 in #104
feat(ui): improve Grader Generator feature with better UX by @XieLipeng0830 in #103
refactor(graders): standardize prompt template format for common graders by @XiaoBoAI in #105
fix trajectory_accuracy_grader by @helloml0326 in #107
Chore/update dependencies by @XiaoBoAI in #106
refactor(graders/agent): standardize prompts to XML tag format by @XieLipeng0830 in #108
refactor(ui): simplify theme styles and improve sidebar layout by @XiaoBoAI in #109
feat(ui): add single evaluation logging for analytics by @XiaoBoAI in #110
feature(strategy): implement evaluation strategy by @ployts in #96
feat: update docker file for judge model post training by @jc200808 in #111
refactor(grader): refactor graders for strategy/executor by @ployts in #99
refactor(runner): runner for grader&executor by @ployts in #100
docs (strategy/executor/integration): update related docs for new version by @ployts in #98
feat: add grader evaluation experiments on datasets by @jc200808 in #95
refactor(graders): reorder JSON schema to reason before score by @XiaoBoAI in #113
feat: add finance graders for stock/event/industry/macro analysis by @XieLipeng0830 in #117
feat: Add multi-turn conversation graders with comprehensive evaluati… by @XieLipeng0830 in #114
refactor(graders/multimodal): standardize prompts to XML tag format by @XiaoBoAI in #112
Feat/ref hallucination arena by @XiaoBoAI in #120
feat: add AgenticGrader and SearchCorrectnessGrader with tool support by @XieLipeng0830 in #82

Full Changelog: v0.2.1...v0.2.2

Contributors

helloml0326, weidankong, and 5 other contributors

Assets 2

21 Jan 06:34

helloml0326

v0.2.1

595f1fd

v0.2.1

Changelog

Integrations & Ecosystem

VERL Integration: Added an integration guide for VERL with async reward evaluation support.
Observability: Added documentation and cookbooks for Langfuse and LangSmith integrations.

Grader Improvements

Standardized Scoring: Adjusted the score range for multimodal graders from [0, 1] to a 1-5 scale for better granularity.
Tool Call Evaluation:
- Added ToolCallPrecisionRecallMatchGrader.
- Renamed ToolCallSequenceMatchGrader to ToolCallStepSequenceMatchGrader and added metric_type parameters.
Terminology Alignment: Renamed "Reward Model" to "Judge Model" across documentation and code to better reflect its role in the evaluation ecosystem.
Refining Logic: Standardized parameter naming, response parsing, and improved streaming support across various graders.

Evaluation Pipeline

Zero-shot Evaluation Pipeline: Added a comprehensive zero-shot evaluation workflow, including win rate chart generators, report generators, and rerun-judge capabilities.
GRPO Enhancements: Improved code style for GRPO, added a dedicated report generator, and updated training documentation for judge models.
Rubric System: Introduced and enhanced Simple Rubric and Iterative Rubric generation and evaluation modules.
Dataset Updates: Updated ChatRLDataset class and associated logic to better support reinforcement learning workflows.

What's Changed

Feature/zero shot evaluation by @XiaoBoAI in #28
docs: add DingTalk community group QR code to README by @helloml0326 in #30
docs: add docs/integration/langfuse.md by @helloml0326 in #31
refactor(grpo): improve code style and add report generator by @XiaoBoAI in #37
feat(integration): add LangSmith integration cookbook and documentation by @ployts in #33
Minor code refactoring, including None checks and argument name changes. by @weizhang25 in #34
feat(template): add GitHub PR template and issue templates by @ployts in #36
Docs/sample reports by @XiaoBoAI in #38
Rename grader validator file name from base.py to grader_validator.py by @weizhang25 in #41
A common util method of formatting history for agent graders. by @weizhang25 in #42
feat: update agent graders by @jc200808 in #35
docs: add Simple Rubric documentation and rename to Generate Rubrics … by @XieLipeng0830 in #45
feat: add report generator and update zero-shot evaluation pipeline by @XiaoBoAI in #32
docs(building_graders): add training reward models guide and update integrations by @XiaoBoAI in #46
chore(test): remove redundant multimodal graders syntax test by @XiaoBoAI in #43
feat: update example code and template processing in agent and common graders by @jc200808 in #49
Use compiled regex objects instead of raw pattern strings. by @weizhang25 in #48
Docs/rename reward to judge model by @XiaoBoAI in #47
feat(security): add pre-commit hooks for secret detection by @XiaoBoAI in #50
refactor(graders): standardize parameter naming and response parsing by @ployts in #44
docs: update index page with new features by @helloml0326 in #51
fix(graders): fix typo in code_execution filename and imports by @XiaoBoAI in #54
Set the name of a customized llm grader. by @weizhang25 in #52
refactor(graders,models): cleanup and improve code quality by @XiaoBoAI in #55
Add OpenJudge integration guide for VERL with async reward evaluation. by @chr6192 in #53
refactor(graders): fix deprecated parameters and improve input validation by @XiaoBoAI in #56
feat(zero_shot): add win rate chart generator by @XiaoBoAI in #57
A utility function that collects all grader information. by @weizhang25 in #58
fix(graders): modify threshold value by @weidankong in #59
feat: update more graders including multimodal graders by @jc200808 in #60
refactor(analyzer): update consistency analyzer and agent grader tests by @ployts in #61
refactor(graders): improve parameter validation and streaming support by @XiaoBoAI in #62
feat(grader): add metric_type parameter to ToolCallSequenceMatchGrader by @helloml0326 in #64
fix: fix a typo in a comment by @Wangzy455 in #66
feat: update ChatRLDataset class and judge model grpo training document by @jc200808 in #67
Update get_all_grader_info logic. by @weizhang25 in #65
feat(zero_shot): enhance evaluation pipeline with rerun-judge and cha… by @XiaoBoAI in #63
chore: migrate links from modelscope to agentscope-ai org by @XieLipeng0830 in #73
refactor(multimodal): change score range from [0,1] to 1-5 scale by @XiaoBoAI in #75
feat(iterative_rubric): enhance rubric generation and evaluation by @XieLipeng0830 in #72
Update grader info util. by @weizhang25 in #74
docs: improve README with comprehensive examples and update simple ru… by @helloml0326 in #76
Docs/update verl integration status by @helloml0326 in #77
feat: update base/openai chat model module by @jc200808 in #70
Feature/add ToolCallPrecisionRecallMatchGrader & rename: ToolCallSequenceMatchGrader ->ToolCallStepSequenceMatchGrader by @helloml0326 in #68

New Contributors

@Wangzy455 made their first contribution in #66

Full Changelog: https://github.com/agentscope-ai/OpenJudge/commits/v0.2.1

Contributors

chr6192, helloml0326, and 7 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 OpenJudge v0.2.2 Changelog

✨ New Features

🌐 Online Playground

📊 Domain-Specific & Advanced Evaluation

🖥️ UI & User Experience

⚙️ Core Architecture

🛠️ Refactoring & Optimization

🐞 Bug Fixes

📝 Documentation & Deployment

Detailed Changelog

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Changelog

Integrations & Ecosystem

Grader Improvements

Evaluation Pipeline

What's Changed

New Contributors

Contributors

Uh oh!

Releases: agentscope-ai/OpenJudge

v0.2.2

🚀 OpenJudge v0.2.2 Changelog

✨ New Features

🌐 Online Playground

📊 Domain-Specific & Advanced Evaluation

🖥️ UI & User Experience

⚙️ Core Architecture

🛠️ Refactoring & Optimization

🐞 Bug Fixes

📝 Documentation & Deployment

Detailed Changelog

Contributors

Uh oh!

v0.2.1

Changelog

Integrations & Ecosystem

Grader Improvements

Evaluation Pipeline

What's Changed

New Contributors

Contributors

Uh oh!