You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: add metric in scores.csv and avoid reading sample_submission.csv (#1152)
* add scores.csv metric name in both task_gen and coder
* a little fix to column names
* small fix
* avoid sample submission read in task_gen
* avoid sample_submission reading in coding
* code change summary bug fix
* little update
* little refinement to eval
* refine coder and runner eval prompts
---------
Co-authored-by: Xu Yang <[email protected]>
stdout+=f"\n### Submission check:\n{submission_check_out}\nIf Submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should still consider the submission as valid and give a positive final decision. "
stdout+=f"\n### Submission check:\n{submission_check_out}\nIf Submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should still consider the submission as valid and give a positive final decision. "
Copy file name to clipboardExpand all lines: rdagent/components/coder/data_science/pipeline/prompts.yaml
+28-47Lines changed: 28 additions & 47 deletions
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ pipeline_coder:
26
26
{% include "scenarios.data_science.share:spec.hyperparameter" %}
27
27
28
28
# Specification your code should follow
29
-
{% include "scenarios.data_science.share:component_spec.Pipeline" %}
29
+
{{ spec }}
30
30
31
31
{% if queried_former_failed_knowledge|length != 0 %}
32
32
## Previous Failed Attempts
@@ -112,10 +112,10 @@ pipeline_coder:
112
112
```
113
113
In debug mode, your code should run faster, so the environment will set a shorter time limit than the standard time limit for your code.
114
114
For example, you can sample ten percent of the training data and run for one epoch, then the full run with ten epochs will take one hundred times the time taken for the debug run. The scale is calculated by yourself depending on the data sampling and epoch number you choose. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.
115
-
Be careful about the train-valid split strategy. StratifiedShuffleSplit is highly risk since the data has some categories with only one sample. If you use StratifiedShuffleSplit, you should consider using a try-except block to catch the error and use a different split strategy if the error occurs. Example code:
115
+
Be careful about the train-valid split strategy. Stratified related split is highly risk since the data has some categories with only one sample. If you use Stratified related split, you should consider using a try-except block to catch the error and use a different split strategy if the error occurs. Example code:
116
116
```python
117
117
try:
118
-
fold_indices = StratifiedKFold(...).split(train_X, train_y) or StratifiedShuffleSplit(...).split(train_X, train_y)
118
+
fold_indices = StratifiedKFold(...).split(train_X, train_y) or StratifiedShuffleSplit or StratifiedSubsetSampler etc.
119
119
except Exception as e:
120
120
fold_indices = KFold(...).split(train_X, train_y) or other split strategy
121
121
```
@@ -206,10 +206,9 @@ pipeline_eval:
206
206
3. A code implementation and its execution output.
207
207
Your task is to rigorously evaluate the code implementation against the provided scenario and task description, ensuring it meets all requirements, adheres to the specified structure, and executes successfully.
208
208
209
-
{% if is_sub_enabled %}
210
-
## Evaluation Steps
209
+
## Evaluation Aspects
211
210
212
-
### Step 1: Execution Success
211
+
### Execution Success
213
212
- Goal: Ensure the code executes successfully without any errors.
214
213
- Notes:
215
214
- Model performance is not evaluated in this step; focus solely on successful execution.
@@ -219,22 +218,7 @@ pipeline_eval:
219
218
- If the code does not execute successfully:
220
219
- Set the "final_decision" to false and write complete analysis in the "execution" field.
221
220
222
-
### Step 2: Submission File Authenticity and Format
223
-
- Goal: Verify that the code correctly generates the final submission in the expected format and that the submission is authentic.
224
-
- Guidelines:
225
-
- The submission file must strictly match the required structure (correct columns, index format, data types). The index names and column names must be identical to the sample submission.
226
-
- Rigorously verify that the submission file was produced by genuine model inference and successful code execution, not by cheating, fallback or exception-handling mechanisms.
227
-
- The submission must be generated from genuine model predictions using the best saved model—never empty, constant, random, or hard-coded values.
228
-
- Submissions must reflect authentic model outputs; any form of fabrication, cheating, or simulated results is strictly prohibited and grounds for rejection.
229
-
- Cross-check both code logic and stdout to ensure predictions originate from real model inference, not from error recovery or placeholder code paths.
230
-
- Only check the format of the submission since only part of the data is provided; the submission might have a different index than the sample submission data.
231
-
- Verify honest failure reporting if training issues occur.
232
-
- If the code passes this step:
233
-
- Proceed to Step 3.
234
-
- If the code does not pass this step:
235
-
- Set the "final_decision" to false and clearly document the issues in the "return_checking" field.
236
-
237
-
### Step 3: Competition Alignment
221
+
### Competition Alignment
238
222
- Goal: Confirm strict adherence to the competition's evaluation rules and experimental setup.
239
223
- Guidelines:
240
224
- Analyze whether the experimental setup and code may cause misalignment between validation and test performance.
@@ -251,7 +235,7 @@ pipeline_eval:
251
235
- Begin the "code" with `[Evaluation error]`, explicitly document any evaluation alignment issues causing experiment failure.
252
236
253
237
{% if debug_mode %}
254
-
### Step 4: Debug Mode Compliance
238
+
### Debug Mode Compliance
255
239
- Goal: Ensure the code follows debug mode requirements.
256
240
- Guidelines:
257
241
- Sufficient debugging information (print statements, clear error messages) should be included to facilitate automatic improvement processes.
@@ -263,15 +247,31 @@ pipeline_eval:
263
247
- Debug time should be reasonable and the estimated time should be reasonable based on the debug time.
264
248
- Data sampling should only be applied in debug mode. Always use the full data in the full run.
265
249
- The label classes number should be the same as the full run even in debug mode.
266
-
- If the code passes this step: Finalize evaluation.
250
+
- If the code passes this step: Proceed to Next Aspects.
267
251
- If the code does not pass this step: Clearly document the debug mode compliance issues and reject the implementation.{% endif %}
268
252
253
+
254
+
### Submission File Format Check
269
255
{% if mle_check %}
270
-
### Step 5: Test format check
271
256
- The user has done a format check for your submission. Since you didn't sample any test data, your debug mode output should be the same format as the full run.
272
257
- The user will put the check result in the "Submission check" section of the execution output.
273
258
- If the submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should give the conclusion that the code executed successfully. If no other code related issues are found, set the "final_decision" to true.
274
259
- If the submission check returns an error message, you should set the "final_decision" to false and clearly document the issues in the "return_checking" field.
260
+
{% elif is_sub_enabled %}
261
+
- Goal: Verify that the code correctly generates the final submission in the expected format and that the submission is authentic.
262
+
- Guidelines:
263
+
- The submission file must strictly match the required structure (correct columns, index format, data types). The index names and column names must be identical to the format specified in the Competition Information's '====== Submission Format ======' section.
264
+
- Rigorously verify that the submission file was produced by genuine model inference and successful code execution, not by cheating, fallback or exception-handling mechanisms.
265
+
- The submission must be generated from genuine model predictions using the best saved model—never empty, constant, random, or hard-coded values.
266
+
- Submissions must reflect authentic model outputs; any form of fabrication, cheating, or simulated results is strictly prohibited and grounds for rejection.
267
+
- Cross-check both code logic and stdout to ensure predictions originate from real model inference, not from error recovery or placeholder code paths.
268
+
- Only check the format of the submission since only part of the data is provided; the submission might have a different index than expected due to data sampling.
269
+
- Verify honest failure reporting if training issues occur.
270
+
- If the code passes this step, Finalize evaluation.
271
+
- If the code does not pass this step:
272
+
- Set the "final_decision" to false and clearly document the issues in the "return_checking" field.
273
+
{% else %}
274
+
Submission File Format Check is not conducted since no target submission format is provided. You should consider this submission file is valid.
275
275
{% endif %}
276
276
277
277
{% if queried_similar_successful_knowledge|length != 0 %}
@@ -290,35 +290,16 @@ pipeline_eval:
290
290
Please respond with your feedback in the following JSON format without anything else.
291
291
```json
292
292
{
293
-
"execution": "Describe whether the code executed successfully, correctly integrating all components and generating the final submission. Include any errors or issues encountered, and append all error messages and full traceback details without summarizing or omitting any information. If errors occurred, analyze the root causes: (1) Are they fundamental algorithmic/approach issues, or (2) Implementation details that can be easily fixed, or (3) Environment/dependency problems?",
294
-
"return_checking": "Examine the generated files by cross-referencing the code logic and stdout output. Verify: (1) Format matches sample submission (index, column names, CSV content); (2) **File generation authenticity**: Is the file genuinely produced by successful model execution, or is it a result of exception handling/fallback mechanisms? Cite specific code sections and stdout evidence.",
293
+
"execution": "Describe whether the code executed successfully. Include any errors or issues encountered, and append all error messages and full traceback details without summarizing or omitting any information. If errors occurred, analyze the root causes: (1) Are they fundamental algorithmic/approach issues, or (2) Implementation details that can be easily fixed, or (3) Environment/dependency problems?",
294
+
"return_checking": "Examine the generated files by cross-referencing the code logic and stdout output. Verify: (1) Format matches required submission format (index, column names, CSV content); (2) **File generation authenticity**: Is the file genuinely produced by successful model execution, or is it a result of exception handling/fallback mechanisms? Cite specific code sections and stdout evidence.",
295
295
"code": "Begin explicitly with [Code analysis] or [Evaluation error]. Provide structured analysis: (1) **Technical Appropriateness**: Does the chosen approach (algorithms, data processing, validation strategy) match this problem's data characteristics and competition requirements? (2) **Effective Components**: What specific parts work well and why are they effective for this problem type? (3) **Issues & Improvements**: Identify concrete problems and suggest actionable improvement directions (without providing actual code). (4) **Code Quality**: Assess readability, structure, and adherence to specifications.",
296
296
"final_decision": <true/false>
297
297
}
298
298
```
299
-
{% else %}
300
-
## Evaluation Scope
301
-
Your focus is to check whether the workflow code executes successfully.
302
299
303
-
You will be given the execution output (`stdout`) to determine correctness.
304
-
305
-
[Note]
306
-
1. Model performance is NOT a concern in this evaluation—only correct execution and formatting matter.
307
-
308
-
Please respond with your feedback in the following JSON format and order
309
-
```json
310
-
{
311
-
"execution": "Describe whether the code executed successfully. Include any errors or issues encountered, and append all error messages and full traceback details without summarizing or omitting any information. If errors occurred, analyze the root causes: (1) Are they fundamental algorithmic/approach issues, or (2) Implementation details that can be easily fixed, or (3) Environment/dependency problems?",
312
-
"return_checking": "Describe the expected file to be generated.",
313
-
"code": "Provide structured analysis: (1) **Technical Appropriateness**: Does the chosen approach (algorithms, data processing, validation strategy) match this problem's data characteristics and requirements? (2) **Effective Components**: What specific parts work well and why are they effective for this problem type? (3) **Issues & Improvements**: Identify concrete problems and suggest actionable improvement directions (without providing actual code). (4) **Code Quality**: Assess readability, structure, and adherence to specifications.",
314
-
"final_decision": <true/false>
315
-
}
316
-
```
317
-
{% endif %}
318
-
# NOTE: when is_sub_enabled == False, we don't have any checking about the return. So it is just placeholder currently
stdout+=f"\nSubmission check:\n{submission_check_out}\nIf Submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should still consider the submission as valid and give a positive final decision. "
168
+
stdout+=f"\n### Submission check:\n{submission_check_out}\nIf Submission check returns a 'Submission is valid' or similar message, despite some warning messages, you should still consider the submission as valid and give a positive final decision. "
0 commit comments