fix: refine prompt; runner focus on low hanging fruit (#1076)

you-n-g · web-flow · commit 1778b8c95388 · 2025-07-16T12:29:57.000+08:00
* fix: refine prompts formatting and logic across data science scenarios

* Make runner evaluator more detailed

* feat: switch to system_debugger prompt and print early stopping stats
diff --git a/rdagent/scenarios/data_science/dev/runner/__init__.py b/rdagent/scenarios/data_science/dev/runner/__init__.py
@@ -56,7 +56,7 @@ def implement_one_task(
         else:
             task_information_str = target_task.get_task_information()
             # 1. code
-            system_prompt = T(".prompts:DSCoSTEER.system_refine").r(
+            system_prompt = T(".prompts:DSCoSTEER.system_debugger").r(
                 task_desc=task_information_str,
                 out_spec=PythonBatchEditOut.get_spec(with_del=False),
             )
diff --git a/rdagent/scenarios/data_science/dev/runner/prompts.yaml b/rdagent/scenarios/data_science/dev/runner/prompts.yaml
@@ -2,7 +2,7 @@ DSCoSTEER_eval:
   system: |-
     You are a data scientist responsible for evaluating all the code.
 
-    ## Task Description
+    ## Target Task Description
     The user is trying to build a data science solution in the following scenario:
     {{ scenario }}
 
@@ -18,18 +18,33 @@ DSCoSTEER_eval:
     - Model training
     - Ensembling
 
+    ## You'll be provided with the following information about a solution to the Target Task
+    `code base`: The code base of the solution
+    `the stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing
+    `the time spent on code execution`: The time spent on the code execution
+    `the timeout of code execution`: the time limitation of the code execution
+    `the percent of timeout used`: the percentage of the time limitation used
+   
+    ## Your task is to provide feedback on the solution to the Target Task
+    In the feedback response, 
+    Evaluate the code base based on several aspects, including execution, return checking, and code quality. After your evaluation, make a clear decision to either accept or reject the solution in the `final_decision` section.
+
     The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time.
-    For example, if the code uses only a small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
+    For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
 
     You should also notice other resources utilization hyper-parameters,
     For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
 
     Please provide your feedback in two key-value pairs:
     "hyperparameter_tuning_decision": <true/false>
     "hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning, e.g., increase n_estimators to 1000, increase epochs to 100, increase batch size to 64, give an empty string if decide not to tune the hyperparameter>
-    Notice: You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
-    Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
-    Once you decide to tune the hyperparameter you should set "final_decision" to false.
+    [Notice]
+    - You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
+    - Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
+    - Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence.  If there are no obvious and impactful opportunities and the code runs well, please accept it.
+    - Once you decide to tune the hyperparameter you should set "final_decision" to false.
+    [Format]
+    - "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance."
 
     {% if is_sub_enabled %}
     The user will provide you the whole code base, some logs generated during the execution of the whole workflow. Your evaluation scope includes whether the workflow code:
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml b/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
@@ -114,6 +114,7 @@ scenario_description: |-
   {% else %}
   ====== Background ======
   {{ background }}
+  {% endif %}
 
   {% if eda_output is not none %}The following is the output of the exploratory data analysis (EDA) performed on the dataset, You should carefully analyze it to better craft your feature engineering and model training strategies.
   ====== Data Overview (EDA) ======
@@ -130,10 +131,8 @@ scenario_description: |-
   - Ensure your submission is genuine.
   - Do not manipulate data or return values solely to pass preliminary tests, as this will not lead to successful final evaluation.
 
-  {% endif %}
-
   ====== Evaluation ======
-  {% if not use_raw_description and metric_name %}
+  {% if metric_name %}
   The primary evaluation metric for this task is: **{{ metric_name }}**.
   {% endif %}
   This metric is considered better when it is **{% if metric_direction %}larger{% else %}smaller{% endif %}**.
diff --git a/rdagent/scenarios/data_science/scen/prompts.yaml b/rdagent/scenarios/data_science/scen/prompts.yaml
@@ -3,22 +3,20 @@ scenario_description: |-
   ------Background of the scenario------
   {{ raw_description }}
 
-  The evaluation metrics used is directed as:
-  The metric is better when it is {% if metric_direction %}bigger{% else %}smaller{% endif %}.
-
   {% else %}
-    ------Background of the scenario------
+  ------Background of the scenario------
   {{ background }}
 
+  {% endif %}
+
   ------ Guidelines for participating in the competition ----
   Before submitting your results, we have numerous tests ready to check your code. Please ensure your submission is genuine and do not manipulate data or return values just to pass the tests, as this will not lead to successful final results.
 
   ------The expected output & submission format specifications------
   {{ submission_specifications }}
 
   ------The name of the evaluation metric used------
-  {{ metric_name }}
-  {% endif %}
+  `{{ metric_name }}`
 
   {% if time_limit %}------The time limit to your code------
   You code running is limit to {{ time_limit }}, after this time limit, your code will be terminated. But remember your main target is to achieve the best performance and you have several times to modify your code. So please be bold to make the best use of all the time limit and don't be too conservative.
diff --git a/rdagent/scenarios/data_science/share.yaml b/rdagent/scenarios/data_science/share.yaml
@@ -340,3 +340,5 @@ spec:
         - The validation loss (or metric) reaches a predefined threshold indicating sufficient model performance.
         - The validation loss (or metric) remains stable (i.e., does not improve) for a set number of consecutive epochs.
       - Clearly document the early stopping criteria and ensure they are configurable via hyperparameters.
+    5. Print necessary information to stdout to support future optimization and hyperparameter tuning.
+      - If validation data are used, print the early stopping round/step, as well as the training and validation losses during training.

Original file line number	Diff line number	Diff line change
`@@ -56,7 +56,7 @@ def implement_one_task(`
`56`	`56`	`else:`
`57`	`57`	`task_information_str = target_task.get_task_information()`
`58`	`58`	`# 1. code`
`59`		`- system_prompt = T(".prompts:DSCoSTEER.system_refine").r(`
	`59`	`+ system_prompt = T(".prompts:DSCoSTEER.system_debugger").r(`
`60`	`60`	`task_desc=task_information_str,`
`61`	`61`	`out_spec=PythonBatchEditOut.get_spec(with_del=False),`
`62`	`62`	`)`