You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: refine prompt; runner focus on low hanging fruit (#1076)
* fix: refine prompts formatting and logic across data science scenarios
* Make runner evaluator more detailed
* feat: switch to system_debugger prompt and print early stopping stats
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/dev/runner/prompts.yaml
+20-5Lines changed: 20 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@ DSCoSTEER_eval:
2
2
system: |-
3
3
You are a data scientist responsible for evaluating all the code.
4
4
5
-
## Task Description
5
+
## Target Task Description
6
6
The user is trying to build a data science solution in the following scenario:
7
7
{{ scenario }}
8
8
@@ -18,18 +18,33 @@ DSCoSTEER_eval:
18
18
- Model training
19
19
- Ensembling
20
20
21
+
## You'll be provided with the following information about a solution to the Target Task
22
+
`code base`: The code base of the solution
23
+
`the stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing
24
+
`the time spent on code execution`: The time spent on the code execution
25
+
`the timeout of code execution`: the time limitation of the code execution
26
+
`the percent of timeout used`: the percentage of the time limitation used
27
+
28
+
## Your task is to provide feedback on the solution to the Target Task
29
+
In the feedback response,
30
+
Evaluate the code base based on several aspects, including execution, return checking, and code quality. After your evaluation, make a clear decision to either accept or reject the solution in the `final_decision` section.
31
+
21
32
The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time.
22
-
For example, if the code uses only a small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
33
+
For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
23
34
24
35
You should also notice other resources utilization hyper-parameters,
25
36
For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
26
37
27
38
Please provide your feedback in two key-value pairs:
28
39
"hyperparameter_tuning_decision": <true/false>
29
40
"hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning, e.g., increase n_estimators to 1000, increase epochs to 100, increase batch size to 64, give an empty string if decide not to tune the hyperparameter>
30
-
Notice: You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
31
-
Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
32
-
Once you decide to tune the hyperparameter you should set "final_decision" to false.
41
+
[Notice]
42
+
- You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
43
+
- Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
44
+
- Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
45
+
- Once you decide to tune the hyperparameter you should set "final_decision" to false.
46
+
[Format]
47
+
- "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance."
33
48
34
49
{% if is_sub_enabled %}
35
50
The user will provide you the whole code base, some logs generated during the execution of the whole workflow. Your evaluation scope includes whether the workflow code:
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
+2-3Lines changed: 2 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -114,6 +114,7 @@ scenario_description: |-
114
114
{% else %}
115
115
====== Background ======
116
116
{{ background }}
117
+
{% endif %}
117
118
118
119
{% if eda_output is not none %}The following is the output of the exploratory data analysis (EDA) performed on the dataset, You should carefully analyze it to better craft your feature engineering and model training strategies.
119
120
====== Data Overview (EDA) ======
@@ -130,10 +131,8 @@ scenario_description: |-
130
131
- Ensure your submission is genuine.
131
132
- Do not manipulate data or return values solely to pass preliminary tests, as this will not lead to successful final evaluation.
132
133
133
-
{% endif %}
134
-
135
134
====== Evaluation ======
136
-
{% if not use_raw_description and metric_name %}
135
+
{% if metric_name %}
137
136
The primary evaluation metric for this task is: **{{ metric_name }}**.
138
137
{% endif %}
139
138
This metric is considered better when it is **{% if metric_direction %}larger{% else %}smaller{% endif %}**.
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/scen/prompts.yaml
+4-6Lines changed: 4 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -3,22 +3,20 @@ scenario_description: |-
3
3
------Background of the scenario------
4
4
{{ raw_description }}
5
5
6
-
The evaluation metrics used is directed as:
7
-
The metric is better when it is {% if metric_direction %}bigger{% else %}smaller{% endif %}.
8
-
9
6
{% else %}
10
-
------Background of the scenario------
7
+
------Background of the scenario------
11
8
{{ background }}
12
9
10
+
{% endif %}
11
+
13
12
------ Guidelines for participating in the competition ----
14
13
Before submitting your results, we have numerous tests ready to check your code. Please ensure your submission is genuine and do not manipulate data or return values just to pass the tests, as this will not lead to successful final results.
15
14
16
15
------The expected output & submission format specifications------
17
16
{{ submission_specifications }}
18
17
19
18
------The name of the evaluation metric used------
20
-
{{ metric_name }}
21
-
{% endif %}
19
+
`{{ metric_name }}`
22
20
23
21
{% if time_limit %}------The time limit to your code------
24
22
You code running is limit to {{ time_limit }}, after this time limit, your code will be terminated. But remember your main target is to achieve the best performance and you have several times to modify your code. So please be bold to make the best use of all the time limit and don't be too conservative.
0 commit comments