Skip to content

Commit 1778b8c

Browse files
authored
fix: refine prompt; runner focus on low hanging fruit (#1076)
* fix: refine prompts formatting and logic across data science scenarios * Make runner evaluator more detailed * feat: switch to system_debugger prompt and print early stopping stats
1 parent 0c9f193 commit 1778b8c

File tree

5 files changed

+29
-15
lines changed

5 files changed

+29
-15
lines changed

rdagent/scenarios/data_science/dev/runner/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def implement_one_task(
5656
else:
5757
task_information_str = target_task.get_task_information()
5858
# 1. code
59-
system_prompt = T(".prompts:DSCoSTEER.system_refine").r(
59+
system_prompt = T(".prompts:DSCoSTEER.system_debugger").r(
6060
task_desc=task_information_str,
6161
out_spec=PythonBatchEditOut.get_spec(with_del=False),
6262
)

rdagent/scenarios/data_science/dev/runner/prompts.yaml

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ DSCoSTEER_eval:
22
system: |-
33
You are a data scientist responsible for evaluating all the code.
44
5-
## Task Description
5+
## Target Task Description
66
The user is trying to build a data science solution in the following scenario:
77
{{ scenario }}
88
@@ -18,18 +18,33 @@ DSCoSTEER_eval:
1818
- Model training
1919
- Ensembling
2020
21+
## You'll be provided with the following information about a solution to the Target Task
22+
`code base`: The code base of the solution
23+
`the stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing
24+
`the time spent on code execution`: The time spent on the code execution
25+
`the timeout of code execution`: the time limitation of the code execution
26+
`the percent of timeout used`: the percentage of the time limitation used
27+
28+
## Your task is to provide feedback on the solution to the Target Task
29+
In the feedback response,
30+
Evaluate the code base based on several aspects, including execution, return checking, and code quality. After your evaluation, make a clear decision to either accept or reject the solution in the `final_decision` section.
31+
2132
The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time.
22-
For example, if the code uses only a small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
33+
For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
2334
2435
You should also notice other resources utilization hyper-parameters,
2536
For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
2637
2738
Please provide your feedback in two key-value pairs:
2839
"hyperparameter_tuning_decision": <true/false>
2940
"hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning, e.g., increase n_estimators to 1000, increase epochs to 100, increase batch size to 64, give an empty string if decide not to tune the hyperparameter>
30-
Notice: You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
31-
Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
32-
Once you decide to tune the hyperparameter you should set "final_decision" to false.
41+
[Notice]
42+
- You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
43+
- Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
44+
- Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
45+
- Once you decide to tune the hyperparameter you should set "final_decision" to false.
46+
[Format]
47+
- "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance."
3348
3449
{% if is_sub_enabled %}
3550
The user will provide you the whole code base, some logs generated during the execution of the whole workflow. Your evaluation scope includes whether the workflow code:

rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ scenario_description: |-
114114
{% else %}
115115
====== Background ======
116116
{{ background }}
117+
{% endif %}
117118
118119
{% if eda_output is not none %}The following is the output of the exploratory data analysis (EDA) performed on the dataset, You should carefully analyze it to better craft your feature engineering and model training strategies.
119120
====== Data Overview (EDA) ======
@@ -130,10 +131,8 @@ scenario_description: |-
130131
- Ensure your submission is genuine.
131132
- Do not manipulate data or return values solely to pass preliminary tests, as this will not lead to successful final evaluation.
132133
133-
{% endif %}
134-
135134
====== Evaluation ======
136-
{% if not use_raw_description and metric_name %}
135+
{% if metric_name %}
137136
The primary evaluation metric for this task is: **{{ metric_name }}**.
138137
{% endif %}
139138
This metric is considered better when it is **{% if metric_direction %}larger{% else %}smaller{% endif %}**.

rdagent/scenarios/data_science/scen/prompts.yaml

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,20 @@ scenario_description: |-
33
------Background of the scenario------
44
{{ raw_description }}
55
6-
The evaluation metrics used is directed as:
7-
The metric is better when it is {% if metric_direction %}bigger{% else %}smaller{% endif %}.
8-
96
{% else %}
10-
------Background of the scenario------
7+
------Background of the scenario------
118
{{ background }}
129
10+
{% endif %}
11+
1312
------ Guidelines for participating in the competition ----
1413
Before submitting your results, we have numerous tests ready to check your code. Please ensure your submission is genuine and do not manipulate data or return values just to pass the tests, as this will not lead to successful final results.
1514
1615
------The expected output & submission format specifications------
1716
{{ submission_specifications }}
1817
1918
------The name of the evaluation metric used------
20-
{{ metric_name }}
21-
{% endif %}
19+
`{{ metric_name }}`
2220
2321
{% if time_limit %}------The time limit to your code------
2422
You code running is limit to {{ time_limit }}, after this time limit, your code will be terminated. But remember your main target is to achieve the best performance and you have several times to modify your code. So please be bold to make the best use of all the time limit and don't be too conservative.

rdagent/scenarios/data_science/share.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,3 +340,5 @@ spec:
340340
- The validation loss (or metric) reaches a predefined threshold indicating sufficient model performance.
341341
- The validation loss (or metric) remains stable (i.e., does not improve) for a set number of consecutive epochs.
342342
- Clearly document the early stopping criteria and ensure they are configurable via hyperparameters.
343+
5. Print necessary information to stdout to support future optimization and hyperparameter tuning.
344+
- If validation data are used, print the early stopping round/step, as well as the training and validation losses during training.

0 commit comments

Comments
 (0)