Skip to content

Commit 909c7d6

Browse files
qew21you-n-gv-jianwan
authored
feat: enable finetune llm (#1055)
* feat: start with previous workspace * feat: finetune llm * add PrevModelLoadEvaluator --------- Co-authored-by: Young <[email protected]> Co-authored-by: v-jianwan <[email protected]> Co-authored-by: you-n-g <[email protected]>
1 parent 6d01e3e commit 909c7d6

File tree

25 files changed

+806
-17
lines changed

25 files changed

+806
-17
lines changed

docs/scens/catalog.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,4 @@ The supported scenarios are listed below:
4343
model_agent_fin
4444
model_copilot_general
4545
data_science
46+
finetune

docs/scens/finetune.rst

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
.. _finetune_agent:
2+
3+
=============================
4+
Fine-tuning an Existing Model
5+
=============================
6+
7+
## **🎯 Scenario: Continue Training on a Pre-trained Model**
8+
9+
In this workflow the **Data Science Agent** starts from a *previously trained* model (and its training script), performs additional fine-tuning on new data, and then re-uses the updated weights for subsequent inference runs.
10+
11+
🚧 Directory Structure
12+
13+
Your competition folder (here called ``custom_data``) must contain **one extra sub-directory** named ``prev_model`` where you keep the old weights and the code that produced them:
14+
15+
.. code-block:: text
16+
17+
ds_data
18+
└── custom_data
19+
├── train.csv
20+
├── test.csv
21+
├── sample_submission.csv # optional
22+
├── description.md # optional
23+
├── sample.py # optional
24+
└── prev_model # ← NEW
25+
├── models/ # previous checkpoints (e.g. *.bin, *.pt, *.ckpt)
26+
└── main.py # training/inference scripts you used before
27+
28+
If your competition provides custom grading/validation scripts, keep them under ``ds_data/eval/custom_data`` exactly as before.
29+
30+
🔧 Environment Setup
31+
~~~~~~~~~~~~~~~~~~~~~~
32+
33+
Add or update the following variables in **.env** (examples shown):
34+
35+
.. code-block:: sh
36+
37+
# required for all Data-Science runs
38+
dotenv set DS_LOCAL_DATA_PATH <your local path>/ds_data
39+
40+
# optional: choose docker / conda, etc.
41+
dotenv set DS_CODER_COSTEER_ENV_TYPE docker
42+
43+
🚀 How It Works at Runtime
44+
45+
1. **First run**
46+
47+
* `rdagent` detects `prev_model/models`.
48+
* It loads the latest checkpoint and prepare the fine-tuning based on code found under `prev_model/*.py` (or your own pipeline if you override it).
49+
* Fine-tuned weights are written to `./workspace_input/models`.
50+
51+
2. **Subsequent runs**
52+
53+
* When you execute `python ./workspace_input/main.py`, the script first looks for a checkpoint in `./workspace_input/models`.
54+
* If found, it **skips fine-tuning** and goes straight to prediction / submission generation.
55+
56+
⏰ Managing Timeouts
57+
58+
59+
By default:
60+
61+
* **Debug loop**: 1 hour (``DS_DEBUG_TIMEOUT=3600`` seconds)
62+
* **Full run** : 3 hours (``DS_FULL_TIMEOUT=10800`` seconds)
63+
64+
Override either value in **.env**:
65+
66+
.. code-block:: sh
67+
68+
# give the debug loop 45 min and the full loop 6 h
69+
dotenv set DS_DEBUG_TIMEOUT 2700
70+
dotenv set DS_FULL_TIMEOUT 21600
71+
72+
- 🚀 **Run the Application**
73+
74+
- You can directly run the application by using the following command:
75+
76+
.. code-block:: sh
77+
78+
dotenv run -- python rdagent/app/finetune/data_science/loop.py --competition <Competition ID>
79+
80+
- Then, you can run the test set score corresponding to each round of the loop.
81+
82+
.. code-block:: sh
83+
84+
dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
85+
86+
Here, <url_to_log> refers to the parent directory of the log folder generated during the run.
87+
88+
- 📥 **Visualize the R&D Process**
89+
90+
- We provide a web UI to visualize the log. You just need to run:
91+
92+
.. code-block:: sh
93+
94+
streamlit run rdagent/log/ui/dsapp.py
95+
96+
- Then you can input the log path and visualize the R&D process.
97+
98+
🔍 MLE-bench Guide: Running ML Engineering via MLE-bench
99+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100+
101+
- 📝 **MLE-bench Overview**
102+
103+
- MLE-bench is a comprehensive benchmark designed to evaluate the ML engineering capabilities of AI systems using real-world scenarios. The dataset comprises 75 Kaggle competitions. Since Kaggle does not provide held-out test sets for these competitions, the benchmark includes preparation scripts that split the publicly available training data into new training and test sets, and grading scripts are provided for each competition to accurately evaluate submission scores.
104+
105+
- 🔧 **Set up Environment for MLE-bench**
106+
107+
- Running R&D-Agent on MLE-bench is designed for full automation. There is no need for manual downloads and data preparation. Simply set the environment variable ``DS_IF_USING_MLE_DATA`` to True.
108+
109+
- At runtime, R&D-Agent will automatically build the Docker image specified at ``rdagent/scenarios/kaggle/docker/mle_bench_docker/Dockerfile``. This image is responsible for downloading the required datasets and grading files for MLE-bench.
110+
111+
- Note: The first run may take longer than subsequent runs as the Docker image and data are being downloaded and set up for the first time.
112+
113+
.. code-block:: sh
114+
115+
dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data
116+
dotenv set DS_IF_USING_MLE_DATA True
117+
118+
- 🔨 **Configuring the Kaggle API**
119+
120+
- Downloading Kaggle competition data requires the Kaggle API. You can set up the Kaggle API by following these steps:
121+
122+
- Register and login on the `Kaggle <https://www.kaggle.com/>`_ website.
123+
124+
- Click on the avatar (usually in the top right corner of the page) -> ``Settings`` -> ``Create New Token``, A file called ``kaggle.json`` will be downloaded.
125+
126+
- Move ``kaggle.json`` to ``~/.config/kaggle/``
127+
128+
- Modify the permissions of the ``kaggle.json`` file.
129+
130+
.. code-block:: sh
131+
132+
chmod 600 ~/.config/kaggle/kaggle.json
133+
134+
- For more information about Kaggle API Settings, refer to the `Kaggle API <https://github.com/Kaggle/kaggle-api>`_.
135+
136+
137+
- 🔩 **Setting the Environment Variables for MLE-bench**
138+
139+
- In addition to auto-downloading the benchmark data, you must also configure the runtime environment for executing the competition code.
140+
- Use the environment variable ``DS_CODER_COSTEER_ENV_TYPE`` to select the execution mode:
141+
142+
• When set to docker (the default), RD-Agent utilizes the official Kaggle Docker image (``gcr.io/kaggle-gpu-images/python:latest``) to ensure that all required packages are available.
143+
• If you prefer to use a custom Docker setup, you can modify the configuration using ``DS_DOCKER_IMAGE`` or ``DS_DOCKERFILE_FOLDER_PATH``.
144+
• Alternatively, if your competition work only demands basic libraries, you may set ``DS_CODER_COSTEER_ENV_TYPE`` to conda. In this mode, you must create a local conda environment named “kaggle” and pre-install the necessary packages. RD-Agent will execute the competition code within this “kaggle” conda environment.
145+
146+
.. code-block:: sh
147+
148+
# Configure the runtime environment: choice between 'docker' (default) or 'conda'
149+
dotenv set DS_CODER_COSTEER_ENV_TYPE docker
150+
151+
- **Additional Guidance**
152+
153+
- **Combine different LLM Models at R&D Stage**
154+
155+
- You can combine different LLM models at the R&D stage.
156+
157+
- By default, when you set environment variable ``CHAT_MODEL``, it covers both R&D stages. When customizing the model for the development stage, you can set:
158+
159+
.. code-block:: sh
160+
161+
# This example sets the model to "o3-mini". For some models, the reasoning effort shoule be set to "None".
162+
dotenv set LITELLM_CHAT_MODEL_MAP '{"coding":{"model":"o3-mini","reasoning_effort":"high"},"running":{"model":"o3-mini","reasoning_effort":"high"}}'
163+
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import os
2+
3+
from pydantic_settings import SettingsConfigDict
4+
5+
from rdagent.app.data_science.conf import DS_RD_SETTING
6+
from rdagent.core.conf import RD_AGENT_SETTINGS, ExtendedBaseSettings
7+
8+
9+
class DSFinetuneScen(ExtendedBaseSettings):
10+
model_config = SettingsConfigDict(env_prefix="FT_", protected_namespaces=())
11+
scen: str = "rdagent.app.finetune.data_science.scen.DSFinetuneScen"
12+
"""
13+
Scenario class for data science tasks.
14+
- For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
15+
- For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
16+
- For LLM finetune scenarios, use: "rdagent.app.finetune.llm.scen.LLMFinetuneScen"
17+
- For Data science finetune scenarios, use: "rdagent.app.finetune.data_science.scen.DSFinetuneScen"
18+
"""
19+
20+
debug_timeout: int = 3600
21+
"""The timeout limit for running on debugging data"""
22+
full_timeout: int = 10800
23+
"""The timeout limit for running on full data"""
24+
25+
coder_on_whole_pipeline: bool = True
26+
enable_model_dump: bool = True
27+
app_tpl: str = "app/finetune/data_science/tpl"
28+
29+
30+
def update_settings(competition: str):
31+
"""
32+
Update the RD_AGENT_SETTINGS with the values from DS_FINETUNE_SETTINGS.
33+
"""
34+
DS_FINETUNE_SETTINGS = DSFinetuneScen()
35+
RD_AGENT_SETTINGS.app_tpl = DS_FINETUNE_SETTINGS.app_tpl
36+
os.environ["DS_CODER_COSTEER_EXTRA_EVALUATOR"] = '["rdagent.app.finetune.share.eval.PrevModelLoadEvaluator"]'
37+
for field_name, new_value in DS_FINETUNE_SETTINGS.model_dump().items():
38+
if hasattr(DS_RD_SETTING, field_name):
39+
setattr(DS_RD_SETTING, field_name, new_value)
40+
DS_RD_SETTING.competition = competition
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import asyncio
2+
from pathlib import Path
3+
4+
import fire
5+
6+
from rdagent.app.data_science.conf import DS_RD_SETTING
7+
from rdagent.app.finetune.data_science.conf import update_settings
8+
from rdagent.core.utils import import_class
9+
from rdagent.log import rdagent_logger as logger
10+
from rdagent.scenarios.data_science.loop import DataScienceRDLoop
11+
12+
13+
def main(
14+
model: str | None = None,
15+
competition: str | None = None,
16+
):
17+
"""
18+
Parameters
19+
----------
20+
competition :
21+
Competition name.
22+
23+
Auto R&D Evolving loop for models finetune.
24+
You can continue running a session by using the command:
25+
.. code-block:: bash
26+
dotenv run -- python rdagent/app/finetune/data_science/loop.py --competition aerial-cactus-identification
27+
"""
28+
if not competition:
29+
raise Exception("Please specify competition name.")
30+
31+
model_folder = Path(DS_RD_SETTING.local_data_path) / competition / "prev_model"
32+
if not model_folder.exists():
33+
raise Exception(f"Please put the model path to {model_folder}.")
34+
update_settings(competition)
35+
rd_loop: DataScienceRDLoop = DataScienceRDLoop(DS_RD_SETTING)
36+
asyncio.run(rd_loop.run())
37+
38+
39+
if __name__ == "__main__":
40+
fire.Fire(main)
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from pathlib import Path
2+
3+
from rdagent.app.data_science.conf import DS_RD_SETTING
4+
from rdagent.core.scenario import Scenario
5+
from rdagent.log import rdagent_logger as logger
6+
from rdagent.scenarios.data_science.scen import DataScienceScen
7+
from rdagent.scenarios.data_science.scen.utils import describe_data_folder_v2
8+
from rdagent.utils.agent.tpl import T
9+
10+
11+
class DSFinetuneScen(DataScienceScen):
12+
"""DSFinetuneScen Scenario"""
13+
14+
def _get_data_folder_description(self) -> str:
15+
folder_desc = describe_data_folder_v2(
16+
Path(DS_RD_SETTING.local_data_path) / self.competition,
17+
show_nan_columns=DS_RD_SETTING.show_nan_columns,
18+
max_length=20000, # more context for model script
19+
)
20+
return folder_desc
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pipeline_coder:
2+
system: |-
3+
{% include "rdagent.components.coder.data_science.pipeline.prompts:pipeline_coder.system" %}
4+
NOTE: Ensure that base model form `{% include "scenarios.data_science.share:scen.input_path" %}prev_model` is correctly loaded, you are supposed to finetune the base model.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
task_gen:
2+
system: |-
3+
{% include "rdagent.scenarios.data_science.proposal.exp_gen.prompts_v2:task_gen.system" %}
4+
NOTE: You MUST load base model form `{% include "scenarios.data_science.share:scen.input_path" %}prev_model`. Your main goal is to finetune it.
5+
6+

rdagent/app/finetune/llm/conf.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
import os
2+
3+
from pydantic_settings import SettingsConfigDict
4+
5+
from rdagent.app.data_science.conf import DS_RD_SETTING
6+
from rdagent.core.conf import RD_AGENT_SETTINGS, ExtendedBaseSettings
7+
8+
9+
class LLMFinetuneScen(ExtendedBaseSettings):
10+
model_config = SettingsConfigDict(env_prefix="FT_", protected_namespaces=())
11+
scen: str = "rdagent.app.finetune.llm.scen.LLMFinetuneScen"
12+
"""
13+
Scenario class for data science tasks.
14+
- For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
15+
- For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
16+
- For LLM finetune scenarios, use: "rdagent.app.finetune.llm.scen.LLMFinetuneScen"
17+
- For Data science finetune scenarios, use: "rdagent.app.finetune.data_science.scen.DSFinetuneScen"
18+
"""
19+
20+
hypothesis_gen: str = "rdagent.app.finetune.llm.proposal.FinetuneExpGen"
21+
"""Hypothesis generation class"""
22+
23+
debug_timeout: int = 36000
24+
"""The timeout limit for running on debugging data"""
25+
full_timeout: int = 360000
26+
"""The timeout limit for running on full data"""
27+
28+
coder_on_whole_pipeline: bool = True
29+
enable_model_dump: bool = True
30+
app_tpl: str = "app/finetune/llm/tpl"
31+
32+
33+
def update_settings(competition: str):
34+
"""
35+
Update the RD_AGENT_SETTINGS with the values from LLM_FINETUNE_SETTINGS.
36+
"""
37+
LLM_FINETUNE_SETTINGS = LLMFinetuneScen()
38+
RD_AGENT_SETTINGS.app_tpl = LLM_FINETUNE_SETTINGS.app_tpl
39+
os.environ["DS_CODER_COSTEER_EXTRA_EVALUATOR"] = '["rdagent.app.finetune.share.eval.PrevModelLoadEvaluator"]'
40+
for field_name, new_value in LLM_FINETUNE_SETTINGS.model_dump().items():
41+
if hasattr(DS_RD_SETTING, field_name):
42+
setattr(DS_RD_SETTING, field_name, new_value)
43+
DS_RD_SETTING.competition = competition

rdagent/app/finetune/llm/loop.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import asyncio
2+
from pathlib import Path
3+
4+
import fire
5+
6+
from rdagent.app.data_science.conf import DS_RD_SETTING
7+
from rdagent.app.finetune.llm.conf import update_settings
8+
from rdagent.core.utils import import_class
9+
from rdagent.log import rdagent_logger as logger
10+
from rdagent.scenarios.data_science.loop import DataScienceRDLoop
11+
12+
13+
def main(
14+
model: str | None = None,
15+
dataset: str | None = None,
16+
):
17+
"""
18+
Parameters
19+
----------
20+
dataset :
21+
Dateset name, used for finetune.
22+
23+
Auto R&D Evolving loop for models finetune.
24+
You can continue running a session by using the command:
25+
.. code-block:: bash
26+
dotenv run -- python rdagent/app/finetune/llm/loop.py --dataset shibing624/alpaca-zh
27+
"""
28+
if not dataset:
29+
raise Exception("Please specify dataset name.")
30+
31+
model_folder = Path(DS_RD_SETTING.local_data_path) / dataset / "prev_model"
32+
if not model_folder.exists():
33+
raise Exception(f"Please put the model path to {model_folder}.")
34+
update_settings(dataset)
35+
rd_loop: DataScienceRDLoop = DataScienceRDLoop(DS_RD_SETTING)
36+
asyncio.run(rd_loop.run())
37+
38+
39+
if __name__ == "__main__":
40+
fire.Fire(main)

rdagent/app/finetune/llm/prompts.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
scenario_description: |-
2+
------Background of the scenario------
3+
You are a world-class machine learning engineer. Your task is to finetune a model on the given dataset using QLoRA method.
4+
------Dataset Description------
5+
{{ raw_description }}
6+
7+
competition_background: |-
8+
## QLoRA Fine-Tuning
9+
You are a world-class machine learning engineer and prompt engineer specializing in parameter-efficient fine-tuning of large language models using **QLoRA**. Your expertise includes 4-bit quantization, low-rank adaptation, and maximizing performance on GPU clusters. You are committed to building accurate, resource-efficient, and robust LLMs.
10+
11+
- **Fine-Tuning Method**: QLoRA (4-bit quantized LoRA)
12+
- **Training Dataset**:
13+
> {{ raw_description }}

0 commit comments

Comments
 (0)