You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* change DSCoSTEER_eval prompts
* fallback to better exp only
* fix fallback
* fix and reformat
* fix bug when base_fb is None
* add reasoning to hyperparameter evaluation
* feat: add acceptable assessment in exp_feedback (#1159)
* add time
* refine eval prompt and make the logic of tuning check more clear
* some refinement
* fix CI
* fix a small bug, only consider score in runner
* refine comment
* simplify compare function
---------
Co-authored-by: jingyuanlm <[email protected]>
Co-authored-by: Xu <[email protected]>
Co-authored-by: Jensen Lee <[email protected]>
Co-authored-by: Xu Yang <[email protected]>
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/dev/prompts.yaml
+12-1Lines changed: 12 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -64,15 +64,26 @@ exp_feedback:
64
64
- You should provide your feedback based on the current code and SOTA code. Especially focus on the feature engineering part.
65
65
- For example, if the code truncate the line with N words, you can suggest to print the mean, median or quantile of the length of the line for better understanding of the data in the next rounds of experiments.
66
66
67
+
Step 6: Overall Acceptability Assessment
68
+
69
+
- Determine the overall acceptability of the experiment based on the comprehensive evaluation from previous steps:
70
+
- Set `"Acceptable": "yes"` ONLY if ALL of the following conditions are met:
71
+
* Step 1: Submission format is valid
72
+
* Step 2: Evaluation methodology is aligned with competition requirements
73
+
* Step 4: Current code demonstrates clear improvements over SOTA (better practices, efficiency, or interpretability)
74
+
- Set `"Acceptable": "no"` if ANY of the above conditions fail
75
+
- This acceptability assessment serves as a final quality gate to ensure only truly valuable experiments are accepted
76
+
67
77
Provide detailed and constructive feedback structured as follows without anything else:
68
78
{
69
79
"Submission Format Check": "yes or no",
70
80
"First Valid Submission": "yes or no",
71
81
"Code Change Summary": "Clearly summarize the changes made to the code (please cover the most important changes while being concise); during development, extra modifications may be made beyond the intent of the hypothesis, so these changes should also be included to provide complete information",
72
82
"Observations": "Clearly summarize current and SOTA ensemble results with exact scores and notable patterns. Limit to no more than three concise, data-focused sentences. Your observation must be grounded by explicit evidence from scenario description or code implementation, not just validation scores.",
73
-
"Feedback for Hypothesis": Explicitly confirm or refute the hypothesis based on specific data points or performance trends. Limit to two sentences.",
83
+
"Feedback for Hypothesis": "Explicitly confirm or refute the hypothesis based on specific data points or performance trends. Limit to two sentences.",
74
84
"Evaluation Aligned With Task": "yes or no",
75
85
"Replace Best Result": "yes or no",
86
+
"Acceptable": "yes or no",
76
87
"Reasoning": "Clearly explain the reason for success or failure of the experiment. Begin explicitly with [Submission format error], [Evaluation error], [Experiment Analysis] or [Code Analysis] depending on the step at which issues arose. Reference specific scores and methodological differences with SOTA. Limit to three sentences.",
77
88
"EDA Improvement": "improvement suggestion for EDA code, if needed, otherwise set to 'no'. If there is no EDA code, set to 'no'."
stdout+=f"\n### Submission check:\n{submission_check_out}\nIf Submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should still consider the submission as valid and give a positive final decision. "
f"Time spent ratio {time_spent_ratio:.2f} exceeds the limit {DS_RD_SETTING.time_ratio_limit_to_enable_hyperparameter_tuning}, hyperparameter tuning is disabled."
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/dev/runner/prompts.yaml
+33-13Lines changed: 33 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -18,35 +18,52 @@ DSCoSTEER_eval:
18
18
The code is focusing on the following task
19
19
{{ task_desc }}
20
20
21
-
## Evaluation Guidelines
21
+
## Evaluation Criteria
22
22
1. Evaluate the code base based on several aspects, including execution correctness, return checking, and code quality.
23
23
2. Ensure the code does not contain any incorrect, fabricated, or deceptive operations, such as mocking data, scores, or results.
24
24
3. Confirm that the prediction file (`submission.csv`) is generated using only the test dataset, and its format matches the sample submission. Please refer to Submission check section including the format check to the submission.
25
-
If the code does not satisfy the requirements:
25
+
If the code does not satisfy any of the criteria:
26
26
- Set "acceptable" to false.
27
-
If the code satisfy the requirements:
27
+
If the code satisfy all the criteria:
28
28
- Set "acceptable" to true.
29
29
30
30
{% if enable_hyperparameter_tuning_check %}
31
31
# Evaluation 2: Hyperparameter
32
-
## Evaluation Description
33
32
The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time.
34
33
For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
35
34
You should also notice other resources utilization hyper-parameters.
36
35
For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
37
36
38
-
## Evaluation Guidelines
39
-
1. The code execution time or resource utilization suggest that there is room for improvement in the hyperparameters.
40
-
2. The code must apply early stopping strategy already (in order to prevent overfitting).
37
+
## Evaluation Criteria
38
+
1. The code execution time or resource utilization is under-utilized, which suggests that there is room for improvement in the hyperparameter
39
+
2. The code must already applied early stopping strategy to prevent overfitting and the early stopping was not triggered (otherwise, increasing epochs will be wasted).
41
40
3. Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
42
41
4. Only include the suggestions in your response without leak any time limit information because the user might over-fit the model to the time limit.
43
42
5. Never make your judgment only based on the time spent, you should also consider the code and the stdout.
44
-
If the code satisfy the requirements:
45
-
- Set "hyperparameter_tuning_decision" to true.
46
-
- In "hyperparameter_tuning_suggestion", provide a clear, specific, and actionable suggestion. Begin with a concrete observation, then state a direct action to take. Do not use vague language, options, or uncertainty (avoid words like "A or B"). For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still decreasing and early stopping was not activated. Only small portion of the allowed time was used. [Suggestion] Increase epochs to 100 to avoid underfitting and further improve model performance."
47
-
If the code does not satisfy the requirements:
43
+
44
+
In the "reasoning", provide clear, step-by-step reasoning for your hyperparameter tuning evaluation. Explicitly reference the code, stdout, and resource usage to justify your assessment. Ensure your reasoning checks whether all evaluation criteria are satisfied, and highlight any specific observations that support your decision.
45
+
If the code does not satisfy any of the criteria:
48
46
- Set "hyperparameter_tuning_decision" to false.
49
47
- Set "hyperparameter_tuning_suggestion" to an empty string.
48
+
If the code satisfy all the criteria:
49
+
- Set "hyperparameter_tuning_decision" to true.
50
+
- In "hyperparameter_tuning_suggestion", provide a clear, specific, and actionable suggestion. Begin with a concrete observation, then state a direct action to take. Do not use vague language, options, or uncertainty (avoid words like "A or B"). For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still decreasing and early stopping was not activated. Only small portion of the allowed time was used. [Suggestion] Increase epochs to 100 to avoid underfitting and further improve model performance."
51
+
52
+
## Hyperparameter Tuning Guidelines
53
+
1. Task-specific Hyperparameters
54
+
- NLP: Check `max_len`, model size, learning rate, batch size. Suggest increases only if underfitting or low resource usage.
55
+
- CV: Check `image_size`, backbone size, batch size, learning rate, augmentation. Suggest increases if results are poor and resources under-used.
- If validation accuracy is low or loss is high, suggest increasing model size or layers if resources allow. Add regularization if overfitting.
59
+
3. Epochs
60
+
- If early stopping triggered, do not increase epochs. If not triggered and validation improves, suggest more epochs.
61
+
4. Batch Size
62
+
- If memory allows and batch size is low, suggest increasing. If OOM errors, suggest reducing.
63
+
5. Learning Rate
64
+
- If training is slow/underfitting, suggest increasing. If unstable, suggest decreasing.
65
+
6. Data Augmentation
66
+
- For CV/NLP, suggest tuning augmentation if overfitting or poor generalization.
50
67
{% endif %}
51
68
52
69
## Output format
@@ -57,8 +74,11 @@ DSCoSTEER_eval:
57
74
"return_checking": "Verify the generated files, particularly the submission file. Ensure that its format is valid",
58
75
"code": "Provide feedback on code quality, readability, and adherence to the given specifications.",
59
76
"acceptable": <true/false: if the solution has passed execution, return_checking, and code verification, then it is a valid solution and acceptable. Otherwise it is not acceptable.>,
60
-
{% if enable_hyperparameter_tuning_check %}"hyperparameter_tuning_decision": <true/false>,
61
-
"hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning>,{% endif %}
77
+
{% if enable_hyperparameter_tuning_check %}
78
+
"reasoning": "Provide step-by-step reasoning for hyperparameter tuning evaluation.",
79
+
"hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning>,
0 commit comments