Commit d20cce0
Deep Finance Update with New Judge (#7)
* feat(finworld): Added AgentScope learning protocol and OpenJudge evaluation functionality to the FinWorld task.
- Added the ExampleAgentScopeLearnProtocol class to implement the AgentScope execution flow for multi-turn interactions.
- Integrated semaphore control to manage the parallelism of environment calls, improving environment stepping performance.
- Implemented a mechanism for detecting context overflows and quickly terminating during environment interactions to prevent blocking.
- Added a finworld.yaml configuration file to define project training and rollout parameters.
- Added the FinWorldJudgeByOpenJudge class, integrating multiple evaluators including RM Gallery and OpenJudge (@Haoran).
- Implemented a mechanism for converting task output, asynchronous calls, and retrying to ensure evaluation stability.
- Weight normalization manages the contributions of each evaluator, merging them to calculate the final reward and success determination.
* Precommit fix (#4)
* fix end of files
* autoflake import fix
* add mypy check
* fix test bench import
* refactor(finworld): Replace agent protocol and unify configuration updates
- Renamed ExampleAgentScopeLearnProtocol to ExampleDeepResearchProtocol and modified the execute method signature.
- Unified the parameter name of the model tuner to `tuner` and its related attribute references.
- Optimized the multi-turn interaction step configuration, changing it to use `tuner.config.ajet.rollout.multi_turn.max_steps`.
- Modified the context overflow judgment logic to prevent tool call blocking.
- Updated the finworld.yaml configuration, replacing astune with ajet-related configurations, and adjusted the workflow protocol and environment parameters.
- Modified the default environment variable values and log saving paths in finworld_judge.py.
- Added and improved multi-machine and single-machine startup scripts, supporting dynamic generation of MCP configuration and environment variable loading.
- Added the finworld_single.yaml template to adapt to single-machine training configurations.
- Adjusted the key reference for multi-turn step configuration in ma_deepresearch.py, using the ajet configuration path.
* feat(finworld): Added FinWorld training environment configuration scripts and templates
- Added bash startup scripts for multi-machine, multi-GPU training, supporting dynamic configuration generation and environment variable import.
- Implemented training configuration file templates, supporting automatic injection of various weight parameters and model paths.
- Adjusted the default request timeout of EnvClient from 30 seconds to 300 seconds to accommodate long training requests.
- Added a new finworld example directory and related documentation, improving the example project structure.
* refactor(utils): Remove unused extract and compute functions `extract_tool_stats_from_cmts`
* refactor(finworld): Replace the old model with OpenJudge, update evaluation configuration and scripts
- Replaced model initialization in FinWorldJudgeByOpenJudge with the `_init_openjudge_model` method
- Read Judge model parameters from the configuration file first, using environment variables as a fallback
- Optimized RM Gallery initialization, using configuration-first logic, and improved exception stack trace printing
- Cleaned up and removed the old `_init_model` singleton method and related code
- Updated the example startup script `ajet_finworld.sh`, adding OPENJUDGE_LLM and RM_LLM configurations
- Modified YAML templates and configuration files to unify the structure and field naming of Judge configuration items
- Deleted the outdated `cc_rm4_res2cit2fai2_30b.sh` script
- Adjusted the `env_service` startup path to improve environment activation compatibility
- Adjusted script log output format and content to enhance the clarity of configuration parameter printing
* feat(task_reader): Support data reading of type jsonl_with_env_service
- Added the jsonl_with_env_service type, which allows loading data from jsonl files while calling tools via env_service.
- Extended ResourceKeeper to handle the creation and release logic of environment instances for jsonl_with_env_service.
- Maintained the env_service type logic, calling create_instance to register instances and initializing them using init_messages from the jsonl file.
- Added an example protocol, ExampleDeepResearchProtocol, to implement multi-turn interaction and environment call coordination.
- Provided training scripts and YAML configuration templates for finworld, supporting the jsonl_with_env_service mode training environment.
- Optimized scripts to support multi-node multi-GPU training, including environment variables and Ray cluster configuration.
* feat(core): add finworld task reader support to framework
* feat(finworld): implement specialized data reader and openjudge-based grading logic
* refactor(finworld): optimize configuration templates and prompt engineering
* chore(finworld): update launch scripts and add variant experiment scripts
* feat(finworld): Added support for multi-machine, multi-GPU training scripts and configuration templates:
* chore(git): ignore finworld/yaml/*
* fix(metrics): Fix and enhance the compatibility and debugging output of the metrics update logic
- Modified the `update_metrics` function, adding a `prefix` parameter to distinguish between training and validation metrics.
- Adjusted the data source for extracting `reward_stats` and `tool_stats`, migrating from `workflow_metadata` to `log_metrics`.
- Added debug printing to output the `log_metrics` content and metric key names at key steps for easier troubleshooting.
- Used the appropriate prefix when calling `update_metrics` in `trainer_verl.py`, and added multiple debug prints.
- Modified `WorkflowOutput` to place `tool_stats` and `reward_stats` into the `log_metrics` field.
- Removed redundant and deprecated code for extracting `reward_stats` and calculation functions.
- Added debug information output to the `finworld` and `finworld_judge` modules to track log metrics and scoring data.
* fix(metrics): Remove debug prints and synchronize reward statistics
- Removed debug print statements before and after the `update_metrics` call in `trainer_verl.py`
- Removed debug print statements related to the `log_metrics` key in `finworld.py`
- Removed debug print statements before updating `metadata_stats` in `finworld_judge.py`
- Added logic in `general_runner.py` to synchronize `reward_stats` from `metadata` to `log_metrics` after the judge calculation
- Cleaned up debug print statements within `update_metrics` in `metric_helper`, improving code readability.
* chore: "Stop tracking existing yaml files in tutorial directory"
* fix(task_runner): Synchronize reward_stats to log_metrics
feat(tutorial): Added FinWorld multi-machine multi-GPU training startup script
* refactor(script): Refactored the finworld training script, integrating configuration and startup processes.
* Refactor(deep_finance): Replace and remove finworld-related implementations
- Switched the example directory from example_finworld to example_deep_finance
- Modified startup parameters and logic to support deep_finance, replacing the finworld option
- Replaced finworld_reader with deep_finance_reader in the task reader
- Adjusted environment client configuration in resource management, using deep_finance instead of finworld-related checks
- Updated reward metric tool documentation to support deep_finance
- Deleted finworld-related configuration files, scripts, code, and evaluation modules, cleaning up leftover files and scripts
- Replaced the keyword "finworld" with "deep_finance" in comments and logs
* refactor(deepfinance): Rename and unify DeepFinance module and config references
- Replace all "finworld" and "deep_finance" names with the unified "deepfinance" format.
- Modify command-line arguments to `--with-deepfinance` for consistency.
- Adjust the class name in `task_reader` from `deep_financeReader` to `DeepFinanceReader`.
- Update the documentation description and file name of the `metric_helper` module to DeepFinance.
- Modify environment variables and configuration paths in the example script `deep_finance.sh` to use the `DEEPFINANCE` prefix.
- Update `judge_protocol` to `DeepFinanceJudgeByOpenJudge` in the `deep_finance.yaml` configuration.
- Refactor the `FinWorldJudgeByOpenJudge` class in `deep_finance_judge.py` to `DeepFinanceJudgeByOpenJudge`.
- Rename the `FinworldReader` class in `deep_finance_reader.py` to `DeepFinanceReader`.
- Modify the debug log identifier and corresponding environment variable name to `DEEPFINANCE_DEBUG`.
- Update the evaluation protocol in the `deep_finance_template.yaml` template to `DeepFinanceJudgeByOpenJudge`.
- Ensure that internal references and comments in all modules are updated to use DeepFinance and deepfinance-related names.
* refactor(tutorial): Optimize dynamic generation logic for configuration file paths
* fix(deep_finance): argparse: with-deepfinance
* fix(tutorial): Fixed issues with multi-machine training environment variable settings
* fix(env): Corrected the assignment logic for reward and info when returning environment state
- Corrected the `env_output` return value structure in `BaseGymEnv` to ensure correct assignment of `reward` and `info` fields.
- Removed `RefJudge` and `StructureJudge` related metric calculations and statistics from `reward_metric_helper`.
- Cleaned up redundant code in `reward_metric_helper`, removing invalid comments and statistical items.
- Modified `save_trajectory_as_json` to always print trajectory saving confirmation information.
- Corrected log comments in `example_deep_finance` to avoid meaningless log output.
- Added the `save_trajectory_as_json_file` configuration item to `deep_finance_template.yaml` to support trajectory saving functionality.
* chore(config): Update example_deep_finance configuration and clean up files
- Added a new ignore rule for config file paths in .gitignore
- Deleted the automatically generated mcp_finance_tool_generated.json file in example_deep_finance
- Refactored the deep_finance.yaml configuration file, adjusting project and experiment names
- Reorganized Judge configuration, clarifying openjudge_llm and rm_llm models
- Optimized model paths and training parameter configurations, adding parallel and batch processing settings
- Adjusted data reading methods and training/validation set path placeholders
- Reduced GPU memory usage ratio for rollout to 0.8
- Updated the default save directory path for the trainer to a placeholder variable
- Cleaned up unused and commented-out code to improve configuration file conciseness
* Refactor(metric): Optimize tool metric calculation and data saving logic
- Corrected the data source field for timeline data used during trajectory saving.
- Removed redundant fields in tool execution time, cache hit rate, and error rate statistics.
- Updated .gitignore to add ignore rules for the example script directory.
- Removed unnecessary debugging information from logs to reduce log noise.
- Adjusted log printing in the multi-round interaction execution process to simplify output content.
- Streamlined log code for environment observation and termination checks to improve code readability.
* fix(metric_helper): fix tool cache metric
* fix little bug
* fix(utils): Suppress httpx AsyncClient.aclose() exception warnings
* comments to english
* feat: 支持服务名称前缀功能
- 在 launcher 中添加 --prefix 参数支持
- 在 pty_launch 函数中实现前缀逻辑
- 更新 deep_finance.sh 脚本以使用前缀功能
- 允许在同一环境中运行多个服务实例
* fix: 改进 MultiAgent 消息内容解析逻辑
- 支持 tool_result 格式的消息内容块
- 改进非文本内容的处理逻辑,继续处理其他项而非跳过整个消息
- 添加 tool_use 类型的处理(跳过,因为已通过 tool_calls 字段处理)
- 优化代码结构和注释,提高可读性
* fix: 优化 DeepFinance 判断逻辑和配置
- 修复 tool_stats 提取逻辑,从 log_metrics 中正确获取数据
- 添加惩罚项调试信息输出
- 启用 tool calls 功能(force_disable_toolcalls: False)
- 确保奖励计算准确性
* chore(deps): bump agentscope from 1.0.7 to 1.0.8
* fix(metric_helper): correct trajectory save path and add tool call metric
- Change trajectory save directory from "ctx_trackers" to "trajectory" to organize files better
- Add recording of tool call counts alongside error rates in tool metrics
- Update experiment suffix in deep finance example script for clearer naming convention
* revise message parsing
* fix(metric_helper): update openjudge graders list in reward metric helper
* feat(deep_finance): replace OpenJudge graders with PresentationQualityGrader
- Remove legacy graders and integrate PresentationQualityGrader and GroundingGrader
- Update grader weights and disable unused graders in config and code
- Simplify grader configuration creation with new mappers for report content and traj
- Refactor DeepFinanceJudgeByOpenJudge to support new grading scheme
- Add PresentationQualityGrader implementation with strict JSON output format
- Include utilities for JSON parsing and validation in presentation quality grader
- Add prompt templates for presentation quality grading criteria and instructions
- Provide example script to run PresentationQualityGrader with OpenAIChatModel
- Add traj_adapter utilities to normalize and extract user query and final report
- Update YAML template to replace old grader weights with presentation quality weight
- Create init files to expose PresentationQualityGrader in judge package
* feat(grounding): implement grounding grader for citation compliance evaluation
- add GroundingGrader class to evaluate citation coverage and truthfulness based on dialogue traj
- provide default OpenAIChatModel creation with deterministic options
- implement prompt construction and JSON parsing utilities for model interaction
- calculate scores including coverage, grounding, and invalid citation penalties
- add detailed json_utils module for strict JSON extraction and validation
- introduce prompt templates defining citation auditing rules and user prompts
- supply reference.py with related grounding evaluation logic and RefJudgeEvaluator class
- create __init__.py to expose GroundingGrader module
- add presentation_quality module __init__.py with PresentationQualityGrader export
* fix(deep_finance_judge): add debug logging for OpenJudge evaluation process
* feat(deep_finance): enhance reward metadata and zero score debugging
- Add populate_reward_metadata_from_stats to copy reward stats into reward metadata
- Populate reward metadata in GeneralRunner if reward_stats present in workflow output
- Refine compute_reward_metrics with updated OpenJudge graders: presentation_quality, grounding, planning
- Add _save_zero_score_debug method in DeepFinanceJudgeByOpenJudge to save debug info for zero grader scores
- Remove deprecated RewardStats usage in deep_finance_judge
- Update judge __init__ to export GroundingGrader alongside PresentationQualityGrader
- Clean up debug print statements and logging in deep_finance_judge.py
- Update .gitignore to exclude prepare_data and judge/analytical_sufficiency folders in example_deep_finance tutorial
* feat(presentation_quality): upgrade grading to 1/3/5 scoring system with markdown cleanup
- Add function to strip markdown code block fences in grounding and presentation_quality modules
- Change presentation quality grader to score each of 8 criteria on a 1/3/5 scale instead of pass/fail
- Normalize total score by dividing sum of item scores by max (40), improving granularity
- Update reasoning output to list lowest scoring items with notes for focused feedback
- Revise presentation quality prompt to reflect new 1/3/5 scoring rubric with detailed instructions
- Adjust JSON output schema accordingly, replacing boolean pass with numeric score fields
- Add get_score utility in JSON utils to extract and validate scores from graded items
- Clean report input by removing markdown fences before grading to avoid markup noise
- Add grounding weight configuration in YAML template for improved modular judge weighting
* chore(config): update experiment suffix, prefix and reward weights in deep_finance.sh
* fix(deep_finance): update environment variables and training launch options
* chore(config): parameterize deep finance training configuration
* chore(config): update experiment suffix, prefix, and weight parameters
* fix(example_deep_finance): update dynamic config file generation path
* refactor(judge): remove deprecated presentation quality script
---------
Co-authored-by: binary-husky <[email protected]>
Co-authored-by: Qingxu Fu <[email protected]>
Co-authored-by: qingxu.fu <[email protected]>1 parent df4a593 commit d20cce0
File tree
21 files changed
+1686
-210
lines changed- ajet
- context_tracker
- task_runner
- utils/metric_helper
- tutorial/example_deep_finance
- judge
- grounding
- presentation_quality
- yaml_template
21 files changed
+1686
-210
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
158 | 158 | | |
159 | 159 | | |
160 | 160 | | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
161 | 164 | | |
162 | 165 | | |
163 | 166 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
85 | 97 | | |
86 | 98 | | |
87 | 99 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
76 | 81 | | |
77 | 82 | | |
78 | 83 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
17 | 20 | | |
18 | 21 | | |
19 | 22 | | |
| |||
72 | 75 | | |
73 | 76 | | |
74 | 77 | | |
75 | | - | |
| 78 | + | |
76 | 79 | | |
77 | 80 | | |
78 | 81 | | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
| 82 | + | |
84 | 83 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
91 | 87 | | |
92 | 88 | | |
93 | 89 | | |
| |||
151 | 147 | | |
152 | 148 | | |
153 | 149 | | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | | - | |
| 6 | + | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
19 | 18 | | |
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
26 | | - | |
| 25 | + | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
| 58 | + | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
62 | | - | |
63 | | - | |
64 | 63 | | |
65 | 64 | | |
66 | 65 | | |
| |||
72 | 71 | | |
73 | 72 | | |
74 | 73 | | |
75 | | - | |
| 74 | + | |
76 | 75 | | |
77 | 76 | | |
78 | 77 | | |
| |||
106 | 105 | | |
107 | 106 | | |
108 | 107 | | |
109 | | - | |
| 108 | + | |
110 | 109 | | |
111 | 110 | | |
112 | 111 | | |
113 | 112 | | |
114 | 113 | | |
115 | 114 | | |
116 | 115 | | |
117 | | - | |
| 116 | + | |
118 | 117 | | |
119 | 118 | | |
120 | 119 | | |
| |||
156 | 155 | | |
157 | 156 | | |
158 | 157 | | |
| 158 | + | |
| 159 | + | |
159 | 160 | | |
160 | 161 | | |
161 | 162 | | |
| |||
202 | 203 | | |
203 | 204 | | |
204 | 205 | | |
| 206 | + | |
205 | 207 | | |
206 | 208 | | |
207 | | - | |
| 209 | + | |
208 | 210 | | |
209 | | - | |
| 211 | + | |
210 | 212 | | |
211 | 213 | | |
212 | 214 | | |
| |||
218 | 220 | | |
219 | 221 | | |
220 | 222 | | |
221 | | - | |
| 223 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
4 | | - | |
| 3 | + | |
| 4 | + | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | | - | |
9 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
17 | 16 | | |
18 | 17 | | |
19 | 18 | | |
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
23 | 22 | | |
24 | | - | |
| 23 | + | |
25 | 24 | | |
26 | 25 | | |
27 | 26 | | |
| |||
32 | 31 | | |
33 | 32 | | |
34 | 33 | | |
35 | | - | |
| 34 | + | |
36 | 35 | | |
37 | 36 | | |
38 | | - | |
| 37 | + | |
39 | 38 | | |
40 | 39 | | |
41 | 40 | | |
42 | 41 | | |
43 | 42 | | |
44 | 43 | | |
45 | 44 | | |
46 | | - | |
| 45 | + | |
47 | 46 | | |
48 | 47 | | |
49 | 48 | | |
50 | 49 | | |
51 | 50 | | |
52 | 51 | | |
53 | | - | |
| 52 | + | |
54 | 53 | | |
55 | 54 | | |
56 | 55 | | |
57 | 56 | | |
58 | 57 | | |
59 | 58 | | |
60 | 59 | | |
61 | | - | |
| 60 | + | |
62 | 61 | | |
63 | | - | |
| 62 | + | |
64 | 63 | | |
65 | 64 | | |
66 | 65 | | |
67 | 66 | | |
68 | 67 | | |
69 | | - | |
70 | | - | |
71 | 68 | | |
72 | | - | |
| 69 | + | |
73 | 70 | | |
74 | 71 | | |
75 | 72 | | |
| |||
0 commit comments