feat: add hypothesis guidelines and rule-based ranking (#746)

RolandMinrui · Xu · peteryang1 · web-flow · commit c077b8239cc7 · 2025-04-03T18:48:28.000+08:00
* 1. add hypothesis guidelines 2. add weighted scoring

* fix CI &amp; speed up exp_gen

* random but reproduciable choice on hypothesis

---------

Co-authored-by: Xu &lt;v-xuminrui@microsoft.com&gt;
Co-authored-by: Xu Yang &lt;peteryang@vip.qq.com&gt;
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml b/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
@@ -35,14 +35,12 @@ feedback_problem:
   system: |-
     You are a Kaggle Grandmaster and expert ML engineer with deep expertise in statistics, machine learning, and competition optimization.
     The user is improving a Kaggle competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
-    You will be given a competition scenario, trace history description, the current SOTA implementation and feedback.
+    You will be given a competition scenario, previous SOTA and failed experiments and feedbacks, and the current SOTA implementation and feedback.
     Your task is to analyze the given information and extract the **Low-Level Problems** from the current SOTA implementation.
 
-    {% if not pipeline %}
     ## Low-Level Problems
     ### Definition
-    Low-level problems are specific and fine-grained technical, or methodological issues within one or more of the five components ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow') in the implementation.
-    {% endif %}
+    Low-level problems are specific and fine-grained technical, or methodological issues within the implementation.
     
     ### Specification
     {{ problem_spec }}
@@ -54,10 +52,10 @@ feedback_problem:
     # Scenario Description
     {{ scenario_desc }}
     
-    Here's the former SOTA experiments and their feedbacks:
+    # Previous SOTA Experiments and Feedbacks:
     {{ sota_exp_and_feedback_list_desc }}
 
-    Also, here's the former failed experiments and their feedbacks:
+    # Previous Failed Experiments and Feedbacks:
     {{ failed_exp_and_feedback_list_desc }}
 
     # Current SOTA Implementation
@@ -66,8 +64,8 @@ feedback_problem:
 hypothesis_gen:
   system: |-
     You are a Kaggle Grandmaster and expert ML engineer with deep expertise in statistics, machine learning, and competition optimization.
-    The user is improving a Kaggle competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
-    You will be given a competition scenario, trace history description, the current SOTA implementation, and a list of identified problems.
+    The user is improving a Kaggl e competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
+    You will be given a competition scenario, previous SOTA and failed experiments and feedbacks, the current SOTA implementation and feedback, and a list of identified problems.
     Your role involves two tasks:
     1. **Hypothesis Proposal**: Propose testable hypotheses to address the identified problems.
     2. **Hypothesis Evaluation**: Evaluate the proposed hypotheses across multiple dimensions.
@@ -82,11 +80,21 @@ hypothesis_gen:
     Each hypothesis should focus on the whole pipeline.
     {% endif %}
 
+    ## Hypothesis Guidelines
+    Here are guidelines to aid your hypothesis proposal. You don't need to answer all the questions.
+    1. Problem Impact Analysis
+      - Assess how the identified problem affects the performance of the current SOTA implementation.
+    2. Lessons from Previous Experiments
+      - For persistent problem, analyze why previous experiments failed on this problem.
+      - Review why previous experiments failed to address the problem. Identify patterns, overlooked factors, or misaligned assumptions.
+      - Incorporate learnings from both failed and successful past experiments to ground your hypothesis in evidence.
+    3. Actionable Changes
+      - If the problem relates to time/memory constraints, suggest smaller model sizes or alternative algorithms with reduced complexity.
+      - If the problem involves underperforming models, propose removing or replacing models with significantly worse performance.
+      - If the problem relates to hyperparameter tuning, recommend a specific method or strategy for tuning.
+
     ## Hypothesis Specification
-    1. The hypothesis should be precise, testable, and directly actionable. Avoid general or vague statements. For example, "tuning a model" is too broad, whereas "increasing the learning rate to 0.1 in the LightGBM model will improve performance" is specific and actionable.
-    2. Each hypothesis should focus on a single direction per experiment. Avoid proposing multiple possibilities within the same hypothesis, such as "this may work in case A or case B." Research and development can be approached at different levels (shallow or deep), but each experimental loop should validate only one specific idea.
-    3. The hypothesis should based on current SOTA solution. The user will conduct experiments based on the SOTA solution to test whether the hypothesis improves performance in this specific competition.
-    4. For problems which you think are covered by the current SOTA implementation or by the former hypothesis, you should ignore that problem and not include it in your response. But you should not respond an empty hypothesis list.
+    {{ hypothesis_spec }}
 
 
     # Task 2: Hypothesis Evaluation
@@ -96,22 +104,21 @@ hypothesis_gen:
     Please score the proposed hypothesis from 1 to 10 for each of the following dimensions (where 1 means lowest and 10 means highest):
     1. Problem-Hypothesis Alignment: How well the hypothesis addresses the identified problem.
     2. Expected Impact: The estimated improvement after applying the hypothesis to current SOTA implementation.
-    3. Novelty: Degree of innovation compared to previous attempts.
+    3. Novelty: Degree of innovation compared to previous attempts. If the proposed hypothesis is very similar to previous experiments' hypothesis, assign low novelty score.
     4. Feasibility: The ease of implementing the proposed hypothesis in the current SOTA implementation.
     5. Risk-Reward Balance: The exploration-exploitation balance of the proposed hypothesis.
 
     ## Final Output Format in JSON Schema:
     {{ hypothesis_output_format }}
     
-
   user: |-
     # Scenario Description
     {{ scenario_desc }}
 
-    Here's the former SOTA experiments and their feedbacks:
+    # Previous SOTA Experiments and Feedbacks:
     {{ sota_exp_and_feedback_list_desc }}
 
-    Also, here's the former failed experiments and their feedbacks:
+    # Previous Failed Experiments and Feedbacks:
     {{ failed_exp_and_feedback_list_desc }}
 
     # Current SOTA Implementation
@@ -133,7 +140,13 @@ task_gen:
     ## Specification
     {{ task_specification }}
 
-    ## [Partial Response Format 1] Task Output Format
+    ## Task Design Guidelines
+    The task should be concise with several steps each only in a few sentences. 
+    DON'T repeat the details which has already included in the SOTA code. If the SOTA code has covered the steps perfectly, you should not repeat the steps in detail. 
+    You SHOULD NOT write any code in the task description.
+
+
+    ## [Partial Response Format 1] Task Output Format:
     {{ task_output_format }}
 
     {% if workflow_check %}
@@ -163,36 +176,35 @@ specification:
   problem: |-
     1. The problem should be specific and fine-grained. Avoid general or vague statements. 
     2. The problem should technical or methodological. Focus on design and implementation flaws, not runtime errors.
+  
   hypothesis: |-
     1. The hypothesis should be precise, testable, and directly actionable. Avoid general or vague statements. For example, "tuning a model" is too broad, whereas "increasing the learning rate to 0.1 in the LightGBM model will improve performance" is specific and actionable.
     2. Each hypothesis should focus on a single direction per experiment. Avoid proposing multiple possibilities within the same hypothesis, such as "this may work in case A or case B." Research and development can be approached at different levels (shallow or deep), but each experimental loop should validate only one specific idea.
     3. The hypothesis should based on current SOTA solution. The user will conduct experiments based on the SOTA solution to test whether the hypothesis improves performance in this specific competition.
 
-
 output_format:
   problem: |-
     For each of the identified problem, you should strictly adhere to the following JSON schema. 
     Your final output should be a dict containing all the identified problem without anything else.
     Please respond at most five problems considering the most valuable and recently not explored.
     {
       "problem name 1": {
-        "problem": "Description of the first issue",
-        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials."
+        "problem": "Description of the first issue in no more than three sentences.",
+        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
       },
       "problem name 2": {
-        "problem": "Description of the second issue",
-        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials."
+        "problem": "Description of the second issue in no more than three sentences.",
+        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
       }
     }
   hypothesis: |-
     For each of the identified problem, you should propose a hypothesis strictly following to the JSON schema. Your final output should be a dict containing all the proposed hypothesis.
     {
       "problem name 1": {
-        "observation": "The observation of the given scenario, data characteristics, or trace history.",
+        "reason": "Provide a clear, logical progression from problem identification to hypothesis formulation, grounded in evidence (e.g., trace history, domain principles, or competition constraints). Refer to the Hypothesis Guidelines for better understanding. Reason should be short with no more than two sentences.",
         {% if not pipeline %}"component": "The component name that the hypothesis focus on. Must be one of ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow').",
         {% else %}"component": "The component name that the hypothesis focus on. Must be 'Pipeline'.",
         {% endif %}
-        "reason": "A brief explanation, also in one or two sentences, outlining the rationale behind the hypothesis. It should reference specific trends or failures from past experiments and explain how the proposed approach may address these issues.",
         "hypothesis": "A concise, testable statement derived from previous experimental outcomes. Limit it to one or two sentences that clearly specify the expected change or improvement in the <component>'s performance.",
         "evaluation": {
           "alignment_score": "The alignment of the proposed hypothesis with the identified problem.",
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py b/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py
@@ -11,7 +11,7 @@
 from rdagent.components.coder.data_science.raw_data_loader.exp import DataLoaderTask
 from rdagent.components.coder.data_science.workflow.exp import WorkflowTask
 from rdagent.core.proposal import ExpGen
-from rdagent.oai.llm_utils import APIBackend
+from rdagent.oai.llm_utils import APIBackend, md5_hash
 from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
 from rdagent.scenarios.data_science.proposal.exp_gen.base import DSHypothesis, DSTrace
 from rdagent.utils.agent.tpl import T
@@ -268,7 +268,6 @@ def identify_feedback_problem(
         sys_prompt = T(".prompts_v2:scenario_problem.system").r(
             problem_spec=T(".prompts_v2:specification.problem").r(),
             problem_output_format=T(".prompts_v2:output_format.problem").r(),
-            pipeline=pipeline,
         )
         user_prompt = T(".prompts_v2:feedback_problem.user").r(
             scenario_desc=scenario_desc,
@@ -320,13 +319,30 @@ def hypothesis_rank(self, hypothesis_dict: dict, problem_dict: dict, pipeline: b
         if pipeline:
             problem_dict = {k: v for k, v in hypothesis_dict.items() if v.get("component", "") == "Pipeline"}
 
-        max_score_problem_name = (
-            pd.DataFrame(
-                {problem_name: hypothesis_dict[problem_name]["evaluation"] for problem_name in hypothesis_dict}
-            )
-            .sum()
-            .idxmax(axis=0)
-        )
+        weights = {
+            "alignment_score": 0.2,
+            "impact_score": 0.4,
+            "novelty_score": 0.2,
+            "feasibility_score": 0.1,
+            "risk_reward_balance_score": 0.1,
+        }
+        scores = pd.DataFrame(
+            {
+                problem_name: {
+                    score_key: hypothesis_dict[problem_name]["evaluation"].get(score_key, 0) * weight
+                    for score_key, weight in weights.items()
+                }
+                for problem_name in hypothesis_dict
+            }
+        )
+        scores_sorted = scores.sum().sort_values(ascending=False)
+        if len(scores_sorted) > 5:
+            scores_sorted = scores_sorted[: len(scores_sorted) // 2]
+
+        reproducible_int = int.from_bytes(bytes.fromhex(md5_hash(scores_sorted.to_string())), byteorder="big") % len(
+            scores_sorted
+        )
+        max_score_problem_name = scores_sorted.index[reproducible_int]
         problem = problem_dict.get(max_score_problem_name, {}).get("problem", "Problem not provided")
 
         return DSHypothesis(