Skip to content

Commit 2dde8b8

Browse files
peteryang1peteryangmsjingyuanlm
authored
fix: split then sample & remove simple model guide in ds proposal (#1034)
* fix code timeout & split_then_sample * change- code * change-prompts_v2 * remove more simple guidance in proposal --------- Co-authored-by: Xu Yang <[email protected]> Co-authored-by: jingyuanlm <[email protected]>
1 parent ad37417 commit 2dde8b8

File tree

5 files changed

+20
-43
lines changed

5 files changed

+20
-43
lines changed

rdagent/components/coder/data_science/conf.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,7 @@ class Config:
2727
def get_ds_env(
2828
conf_type: Literal["kaggle", "mlebench"] = "kaggle",
2929
extra_volumes: dict = {},
30-
running_timeout_period: int = (
31-
DS_RD_SETTING.debug_timeout if not DS_RD_SETTING.sample_data_by_LLM else DS_RD_SETTING.full_timeout
32-
),
30+
running_timeout_period: int = DS_RD_SETTING.debug_timeout,
3331
) -> Env:
3432
"""
3533
Retrieve the appropriate environment configuration based on the env_type setting.

rdagent/components/coder/data_science/pipeline/__init__.py

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,7 @@
2222
- Each coder could be tested.
2323
"""
2424

25-
import json
26-
import re
2725
from pathlib import Path
28-
from typing import Dict
2926

3027
from rdagent.app.data_science.conf import DS_RD_SETTING
3128
from rdagent.components.coder.CoSTEER import CoSTEER
@@ -39,14 +36,8 @@
3936
from rdagent.components.coder.CoSTEER.knowledge_management import (
4037
CoSTEERQueriedKnowledge,
4138
)
42-
from rdagent.components.coder.data_science.conf import (
43-
DSCoderCoSTEERSettings,
44-
get_ds_env,
45-
)
39+
from rdagent.components.coder.data_science.conf import DSCoderCoSTEERSettings
4640
from rdagent.components.coder.data_science.pipeline.eval import PipelineCoSTEEREvaluator
47-
from rdagent.components.coder.data_science.raw_data_loader.eval import (
48-
DataLoaderCoSTEEREvaluator,
49-
)
5041
from rdagent.components.coder.data_science.raw_data_loader.exp import DataLoaderTask
5142
from rdagent.components.coder.data_science.share.eval import ModelDumpEvaluator
5243
from rdagent.core.exception import CoderError

rdagent/components/coder/data_science/pipeline/prompts.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,10 @@ pipeline_coder:
7676
```bash
7777
python main.py --debug
7878
```
79-
In debug mode, you should only sample ten percent of the data and run the minimum epochs to quickly test the correctness of the code.
79+
In debug mode, you should only sample ten percent of the training data and run the minimum epochs to quickly test the correctness of the code.
8080
In debug mode, you should implement a timer to measure the time taken for your debug configuration and estimate the time required for the full run.
81-
For example, you can sample ten percent of the data and run for one epoch, then the full run with ten epochs will take one hundred times the time taken for the debug run. The scale is calculated by yourself depending on the data sampling and epoch number you choose. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.
81+
For example, you can sample ten percent of the training data and run for one epoch, then the full run with ten epochs will take one hundred times the time taken for the debug run. The scale is calculated by yourself depending on the data sampling and epoch number you choose. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.
82+
You should sample the data after train valid split. When you split the data after sampling, you might get a class with only one sample which might cause the split strategy to fail.
8283
Your debug code should run exactly the same as the full run, except for the data sampling and epoch number, to ensure the correctness of the code.
8384
You should print total time and estimated time in standard output using print function in the following schema:
8485
=== Start of Debug Information ===

rdagent/scenarios/data_science/dev/runner/__init__.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
from pathlib import Path
2-
from typing import Dict
3-
41
import pandas as pd
52

63
from rdagent.app.data_science.conf import DS_RD_SETTING
@@ -17,7 +14,6 @@
1714
MultiProcessEvolvingStrategy,
1815
)
1916
from rdagent.components.coder.CoSTEER.task import CoSTEERTask
20-
from rdagent.components.coder.data_science.conf import get_ds_env
2117
from rdagent.components.coder.data_science.share.eval import ModelDumpEvaluator
2218
from rdagent.core.exception import RunnerError
2319
from rdagent.core.scenario import Scenario
@@ -26,7 +22,6 @@
2622
from rdagent.scenarios.data_science.dev.runner.eval import DSCoSTEERCoSTEEREvaluator
2723
from rdagent.utils.agent.ret import PythonBatchEditOut
2824
from rdagent.utils.agent.tpl import T
29-
from rdagent.utils.env import DockerEnv, MLEBDockerConf
3025

3126

3227
class DSRunnerMultiProcessEvolvingStrategy(MultiProcessEvolvingStrategy):

0 commit comments

Comments
 (0)