Skip to content
2 changes: 1 addition & 1 deletion rdagent/app/data_science/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
coder_max_loop: int = 10
runner_max_loop: int = 3

sample_data_by_LLM: bool = True
use_sample_data: bool = True
use_raw_description: bool = False
show_nan_columns: bool = False

Expand Down
3 changes: 2 additions & 1 deletion rdagent/components/coder/data_science/pipeline/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,8 @@ def implement_one_task(
out_spec=PythonAgentOut.get_spec(),
runtime_environment=runtime_environment,
enable_model_dump=DS_RD_SETTING.enable_model_dump,
enable_debug_mode=DS_RD_SETTING.sample_data_by_LLM,
enable_debug_mode=not DS_RD_SETTING.use_sample_data,
debug_timeout=DS_RD_SETTING.debug_timeout,
)
user_prompt = T(".prompts:pipeline_coder.user").r(
competition_info=competition_info,
Expand Down
12 changes: 6 additions & 6 deletions rdagent/components/coder/data_science/pipeline/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,13 @@ def evaluate(

stdout = ""
implementation.execute(env=env, entry=get_clear_ws_cmd())
if DS_RD_SETTING.sample_data_by_LLM:
# Because coder runs on full data, we need to run debug mode in advance to save time
if DS_RD_SETTING.use_sample_data:
result = implementation.run(
env=env, entry=f"strace -e trace=file -f -o trace.log python -m coverage run main.py --debug"
env=env, entry=f"strace -e trace=file -f -o trace.log python -m coverage run main.py"
)
else:
result = implementation.run(
env=env, entry=f"strace -e trace=file -f -o trace.log python -m coverage run main.py"
env=env, entry=f"strace -e trace=file -f -o trace.log python -m coverage run main.py --debug"
)

sample_submission_check = True
Expand All @@ -86,7 +85,7 @@ def evaluate(
stdout += f"Code failed to run. Please check the stdout:\n Following the stdout of the debug mode run:\n{result.stdout.strip()}\n"
else:
stdout += f"Code ran successfully.\n Following the stdout of the debug mode run:\n{result.stdout.strip()}\n"
if DS_RD_SETTING.sample_data_by_LLM:
if not DS_RD_SETTING.use_sample_data:
debug_time, full_estimated_time = None, None
if match := re.search(r"debug_time:\s*(\d+(?:.\d+)?)", result.stdout, re.DOTALL):
debug_time = float(match.group(1))
Expand Down Expand Up @@ -152,7 +151,8 @@ def evaluate(

system_prompt = T(".prompts:pipeline_eval.system").r(
is_sub_enabled=test_eval.is_sub_enabled(self.scen.competition),
debug_mode=DS_RD_SETTING.sample_data_by_LLM,
spec=T("scenarios.data_science.share:component_spec.Pipeline").r(),
debug_mode=not DS_RD_SETTING.use_sample_data,
)
user_prompt = T(".prompts:pipeline_eval.user").r(
scenario=self.scen.get_scenario_all_desc(eda_output=eda_output),
Expand Down
49 changes: 32 additions & 17 deletions rdagent/components/coder/data_science/pipeline/prompts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,30 +76,49 @@ pipeline_coder:

{% if enable_debug_mode %}
## Debug Mode
Your code will be executed in a debug mode with following command:
Your code will be executed in a debug mode to quickly test the correctness of the code with following command:
```bash
python main.py --debug
```
In debug mode, you should only sample ten percent of the training data and run the minimum epochs to quickly test the correctness of the code.
In debug mode, the timeout for your code is {{ debug_timeout }} seconds. You should set a reasonable maximum number of iterations for your debug run, so that the code can be completed within the timeout limit.
In debug mode, you should implement a timer to measure the time taken for your debug configuration and estimate the time required for the full run.
For example, you can sample ten percent of the training data and run for one epoch, then the full run with ten epochs will take one hundred times the time taken for the debug run. The scale is calculated by yourself depending on the data sampling and epoch number you choose. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.
You should sample the data after train valid split. When you split the data after sampling, you might get a class with only one sample which might cause the split strategy to fail.
Your debug code should run exactly the same as the full run, except for the data sampling and epoch number, to ensure the correctness of the code.
You may simply estimate the full time by ```estimated_time = debug_time * full_iter / max_iter```. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.

The number of debug iterations should only be applied to the model's training processes.
Do NOT apply to unrelated parts of the code such as data loading, preprocessing, or inference.
Do NOT sample data in debug mode, always use the full data in the full run.
Your debug code should run exactly the same as the full run, except for the maximum iterations, to ensure the correctness of the code.
Iteration number should be small but reasonable like 100 or 1000, depending on the dataset size, model complexity, and time limit.

You should print total time and estimated time in standard output using print function in the following schema:
```
=== Start of Debug Information ===
debug_time: time_taken_for_debug_run_in_seconds (e.g., 'debug_time: 10.0')
estimated_time: estimated_time_for_full_run_in_seconds (e.g., 'estimated_time: 100.0')
=== End of Debug Information ===
```
User will use the following code to match: re.search(r"(.*?)=== Start of Debug Information ===(.*)=== End of Debug Information ===", stdout, re.DOTALL).groups()[1]
Notice, data sampling should only be applied in debug mode. Always use the full data in the full run!
Notice, maximum number of iterations should only be applied in debug mode.
Example code:
```python
class BreakLoop(Exception): pass
if args.debug:
sample_size = int(0.1 * len(train_dataset)) # 10% for debug
max_iter = N # Set a number of iterations for debug mode based on the debug timeout
else:
sample_size = len(train_dataset)
max_iter = len(train_loader) * all_epochs # Use all iterations and all epochs in full run

iterations_count = 0
try:
for epoch in range(max_epochs):
for data, target in train_loader:
# Your training code here
if args.debug and iterations_count >= max_iter:
raise BreakLoop # Use a custom exception to break the loop in debug mode
iterations_count += 1
except BreakLoop:
pass
```
You should be very careful about the label classes number in the debug mode. The label classes should be the same as the full run even when you are in the debug mode. The label classes number is often used to build the model.
For those packages that are cannot fit to track iterations, you can train the model for a small number of epochs in debug mode, such as 1 epoch, to ensure the correctness of the code.
{% endif %}

## General Guidelines
Expand Down Expand Up @@ -214,16 +233,12 @@ pipeline_eval:
{% if debug_mode %}
### Step 4: Debug Mode Compliance
- Goal: Ensure the code follows debug mode requirements.
- Guidlines:
- Guidelines:
- Sufficient debugging information (print statements, clear error messages) should be included to facilitate automatic improvement processes.
- The code should be executed in debug mode with the command `python main.py --debug`.
- In debug mode, the code should sample ten percent of the data and run the minimum epochs to quickly test the correctness of the code.
- Check whether the code follows these requirements. If not, emphasize it in your feedback and reject this implementation.
- Execution time and estimated time for the full run should be checked. Estimated time should not be too large to finish in the given time limit.
- Consider the early stopping mechanism in the code. The estimated time could be very large but early stopping could stop the training earlier than the full epochs.
- Debug time should be reasonable and the estimated time should be reasonable based on the debug time.
- Data sampling should only be applied in debug mode. Always use the full data in the full run.
- The label classes number should be the same as the full run even in debug mode.
- In debug mode, the code must limit the number of training iterations or epochs to ensure completion within the specified timeout (`{{ debug_timeout }}` seconds).
- The debug mode should only affect the training loop (not data loading, preprocessing, or inference).
- The debug run must use the full dataset (no sampling or subsetting), and the workflow should remain identical to the full run except for the iteration/epoch limit.
- If the code passes this step: Finalize evaluation.
- If the code does not pass this step: Clearly document the debug mode compliance issues and reject the implementation.
{% endif %}
Expand Down
2 changes: 1 addition & 1 deletion rdagent/scenarios/data_science/scen/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def __init__(self, competition: str) -> None:
raise FileNotFoundError(f"Cannot find {competition} in {DS_RD_SETTING.local_data_path}")

local_path = DS_RD_SETTING.local_data_path
if not DS_RD_SETTING.sample_data_by_LLM:
if DS_RD_SETTING.use_sample_data:
self.debug_path = f"{local_path}/sample/{competition}"
if not Path(self.debug_path).exists():
sample_py_path = Path(local_path) / competition / "sample.py"
Expand Down