Skip to content

Commit c077b82

Browse files
RolandMinruiXupeteryang1
authored
feat: add hypothesis guidelines and rule-based ranking (#746)
* 1. add hypothesis guidelines 2. add weighted scoring * fix CI & speed up exp_gen * random but reproduciable choice on hypothesis --------- Co-authored-by: Xu <[email protected]> Co-authored-by: Xu Yang <[email protected]>
1 parent 270ff7c commit c077b82

File tree

2 files changed

+61
-33
lines changed

2 files changed

+61
-33
lines changed

rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml

Lines changed: 36 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,12 @@ feedback_problem:
3535
system: |-
3636
You are a Kaggle Grandmaster and expert ML engineer with deep expertise in statistics, machine learning, and competition optimization.
3737
The user is improving a Kaggle competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
38-
You will be given a competition scenario, trace history description, the current SOTA implementation and feedback.
38+
You will be given a competition scenario, previous SOTA and failed experiments and feedbacks, and the current SOTA implementation and feedback.
3939
Your task is to analyze the given information and extract the **Low-Level Problems** from the current SOTA implementation.
4040
41-
{% if not pipeline %}
4241
## Low-Level Problems
4342
### Definition
44-
Low-level problems are specific and fine-grained technical, or methodological issues within one or more of the five components ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow') in the implementation.
45-
{% endif %}
43+
Low-level problems are specific and fine-grained technical, or methodological issues within the implementation.
4644
4745
### Specification
4846
{{ problem_spec }}
@@ -54,10 +52,10 @@ feedback_problem:
5452
# Scenario Description
5553
{{ scenario_desc }}
5654
57-
Here's the former SOTA experiments and their feedbacks:
55+
# Previous SOTA Experiments and Feedbacks:
5856
{{ sota_exp_and_feedback_list_desc }}
5957
60-
Also, here's the former failed experiments and their feedbacks:
58+
# Previous Failed Experiments and Feedbacks:
6159
{{ failed_exp_and_feedback_list_desc }}
6260
6361
# Current SOTA Implementation
@@ -66,8 +64,8 @@ feedback_problem:
6664
hypothesis_gen:
6765
system: |-
6866
You are a Kaggle Grandmaster and expert ML engineer with deep expertise in statistics, machine learning, and competition optimization.
69-
The user is improving a Kaggle competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
70-
You will be given a competition scenario, trace history description, the current SOTA implementation, and a list of identified problems.
67+
The user is improving a Kaggl e competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
68+
You will be given a competition scenario, previous SOTA and failed experiments and feedbacks, the current SOTA implementation and feedback, and a list of identified problems.
7169
Your role involves two tasks:
7270
1. **Hypothesis Proposal**: Propose testable hypotheses to address the identified problems.
7371
2. **Hypothesis Evaluation**: Evaluate the proposed hypotheses across multiple dimensions.
@@ -82,11 +80,21 @@ hypothesis_gen:
8280
Each hypothesis should focus on the whole pipeline.
8381
{% endif %}
8482
83+
## Hypothesis Guidelines
84+
Here are guidelines to aid your hypothesis proposal. You don't need to answer all the questions.
85+
1. Problem Impact Analysis
86+
- Assess how the identified problem affects the performance of the current SOTA implementation.
87+
2. Lessons from Previous Experiments
88+
- For persistent problem, analyze why previous experiments failed on this problem.
89+
- Review why previous experiments failed to address the problem. Identify patterns, overlooked factors, or misaligned assumptions.
90+
- Incorporate learnings from both failed and successful past experiments to ground your hypothesis in evidence.
91+
3. Actionable Changes
92+
- If the problem relates to time/memory constraints, suggest smaller model sizes or alternative algorithms with reduced complexity.
93+
- If the problem involves underperforming models, propose removing or replacing models with significantly worse performance.
94+
- If the problem relates to hyperparameter tuning, recommend a specific method or strategy for tuning.
95+
8596
## Hypothesis Specification
86-
1. The hypothesis should be precise, testable, and directly actionable. Avoid general or vague statements. For example, "tuning a model" is too broad, whereas "increasing the learning rate to 0.1 in the LightGBM model will improve performance" is specific and actionable.
87-
2. Each hypothesis should focus on a single direction per experiment. Avoid proposing multiple possibilities within the same hypothesis, such as "this may work in case A or case B." Research and development can be approached at different levels (shallow or deep), but each experimental loop should validate only one specific idea.
88-
3. The hypothesis should based on current SOTA solution. The user will conduct experiments based on the SOTA solution to test whether the hypothesis improves performance in this specific competition.
89-
4. For problems which you think are covered by the current SOTA implementation or by the former hypothesis, you should ignore that problem and not include it in your response. But you should not respond an empty hypothesis list.
97+
{{ hypothesis_spec }}
9098
9199
92100
# Task 2: Hypothesis Evaluation
@@ -96,22 +104,21 @@ hypothesis_gen:
96104
Please score the proposed hypothesis from 1 to 10 for each of the following dimensions (where 1 means lowest and 10 means highest):
97105
1. Problem-Hypothesis Alignment: How well the hypothesis addresses the identified problem.
98106
2. Expected Impact: The estimated improvement after applying the hypothesis to current SOTA implementation.
99-
3. Novelty: Degree of innovation compared to previous attempts.
107+
3. Novelty: Degree of innovation compared to previous attempts. If the proposed hypothesis is very similar to previous experiments' hypothesis, assign low novelty score.
100108
4. Feasibility: The ease of implementing the proposed hypothesis in the current SOTA implementation.
101109
5. Risk-Reward Balance: The exploration-exploitation balance of the proposed hypothesis.
102110
103111
## Final Output Format in JSON Schema:
104112
{{ hypothesis_output_format }}
105113
106-
107114
user: |-
108115
# Scenario Description
109116
{{ scenario_desc }}
110117
111-
Here's the former SOTA experiments and their feedbacks:
118+
# Previous SOTA Experiments and Feedbacks:
112119
{{ sota_exp_and_feedback_list_desc }}
113120
114-
Also, here's the former failed experiments and their feedbacks:
121+
# Previous Failed Experiments and Feedbacks:
115122
{{ failed_exp_and_feedback_list_desc }}
116123
117124
# Current SOTA Implementation
@@ -133,7 +140,13 @@ task_gen:
133140
## Specification
134141
{{ task_specification }}
135142
136-
## [Partial Response Format 1] Task Output Format
143+
## Task Design Guidelines
144+
The task should be concise with several steps each only in a few sentences.
145+
DON'T repeat the details which has already included in the SOTA code. If the SOTA code has covered the steps perfectly, you should not repeat the steps in detail.
146+
You SHOULD NOT write any code in the task description.
147+
148+
149+
## [Partial Response Format 1] Task Output Format:
137150
{{ task_output_format }}
138151
139152
{% if workflow_check %}
@@ -163,36 +176,35 @@ specification:
163176
problem: |-
164177
1. The problem should be specific and fine-grained. Avoid general or vague statements.
165178
2. The problem should technical or methodological. Focus on design and implementation flaws, not runtime errors.
179+
166180
hypothesis: |-
167181
1. The hypothesis should be precise, testable, and directly actionable. Avoid general or vague statements. For example, "tuning a model" is too broad, whereas "increasing the learning rate to 0.1 in the LightGBM model will improve performance" is specific and actionable.
168182
2. Each hypothesis should focus on a single direction per experiment. Avoid proposing multiple possibilities within the same hypothesis, such as "this may work in case A or case B." Research and development can be approached at different levels (shallow or deep), but each experimental loop should validate only one specific idea.
169183
3. The hypothesis should based on current SOTA solution. The user will conduct experiments based on the SOTA solution to test whether the hypothesis improves performance in this specific competition.
170184
171-
172185
output_format:
173186
problem: |-
174187
For each of the identified problem, you should strictly adhere to the following JSON schema.
175188
Your final output should be a dict containing all the identified problem without anything else.
176189
Please respond at most five problems considering the most valuable and recently not explored.
177190
{
178191
"problem name 1": {
179-
"problem": "Description of the first issue",
180-
"reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials."
192+
"problem": "Description of the first issue in no more than three sentences.",
193+
"reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
181194
},
182195
"problem name 2": {
183-
"problem": "Description of the second issue",
184-
"reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials."
196+
"problem": "Description of the second issue in no more than three sentences.",
197+
"reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
185198
}
186199
}
187200
hypothesis: |-
188201
For each of the identified problem, you should propose a hypothesis strictly following to the JSON schema. Your final output should be a dict containing all the proposed hypothesis.
189202
{
190203
"problem name 1": {
191-
"observation": "The observation of the given scenario, data characteristics, or trace history.",
204+
"reason": "Provide a clear, logical progression from problem identification to hypothesis formulation, grounded in evidence (e.g., trace history, domain principles, or competition constraints). Refer to the Hypothesis Guidelines for better understanding. Reason should be short with no more than two sentences.",
192205
{% if not pipeline %}"component": "The component name that the hypothesis focus on. Must be one of ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow').",
193206
{% else %}"component": "The component name that the hypothesis focus on. Must be 'Pipeline'.",
194207
{% endif %}
195-
"reason": "A brief explanation, also in one or two sentences, outlining the rationale behind the hypothesis. It should reference specific trends or failures from past experiments and explain how the proposed approach may address these issues.",
196208
"hypothesis": "A concise, testable statement derived from previous experimental outcomes. Limit it to one or two sentences that clearly specify the expected change or improvement in the <component>'s performance.",
197209
"evaluation": {
198210
"alignment_score": "The alignment of the proposed hypothesis with the identified problem.",

rdagent/scenarios/data_science/proposal/exp_gen/proposal.py

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
from rdagent.components.coder.data_science.raw_data_loader.exp import DataLoaderTask
1212
from rdagent.components.coder.data_science.workflow.exp import WorkflowTask
1313
from rdagent.core.proposal import ExpGen
14-
from rdagent.oai.llm_utils import APIBackend
14+
from rdagent.oai.llm_utils import APIBackend, md5_hash
1515
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
1616
from rdagent.scenarios.data_science.proposal.exp_gen.base import DSHypothesis, DSTrace
1717
from rdagent.utils.agent.tpl import T
@@ -268,7 +268,6 @@ def identify_feedback_problem(
268268
sys_prompt = T(".prompts_v2:scenario_problem.system").r(
269269
problem_spec=T(".prompts_v2:specification.problem").r(),
270270
problem_output_format=T(".prompts_v2:output_format.problem").r(),
271-
pipeline=pipeline,
272271
)
273272
user_prompt = T(".prompts_v2:feedback_problem.user").r(
274273
scenario_desc=scenario_desc,
@@ -320,13 +319,30 @@ def hypothesis_rank(self, hypothesis_dict: dict, problem_dict: dict, pipeline: b
320319
if pipeline:
321320
problem_dict = {k: v for k, v in hypothesis_dict.items() if v.get("component", "") == "Pipeline"}
322321

323-
max_score_problem_name = (
324-
pd.DataFrame(
325-
{problem_name: hypothesis_dict[problem_name]["evaluation"] for problem_name in hypothesis_dict}
326-
)
327-
.sum()
328-
.idxmax(axis=0)
329-
)
322+
weights = {
323+
"alignment_score": 0.2,
324+
"impact_score": 0.4,
325+
"novelty_score": 0.2,
326+
"feasibility_score": 0.1,
327+
"risk_reward_balance_score": 0.1,
328+
}
329+
scores = pd.DataFrame(
330+
{
331+
problem_name: {
332+
score_key: hypothesis_dict[problem_name]["evaluation"].get(score_key, 0) * weight
333+
for score_key, weight in weights.items()
334+
}
335+
for problem_name in hypothesis_dict
336+
}
337+
)
338+
scores_sorted = scores.sum().sort_values(ascending=False)
339+
if len(scores_sorted) > 5:
340+
scores_sorted = scores_sorted[: len(scores_sorted) // 2]
341+
342+
reproducible_int = int.from_bytes(bytes.fromhex(md5_hash(scores_sorted.to_string())), byteorder="big") % len(
343+
scores_sorted
344+
)
345+
max_score_problem_name = scores_sorted.index[reproducible_int]
330346
problem = problem_dict.get(max_score_problem_name, {}).get("problem", "Problem not provided")
331347

332348
return DSHypothesis(

0 commit comments

Comments
 (0)