debug

jingyuanlm · jingyuanlm · commit 54b2491283de · 2025-07-17T11:14:59.000Z
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml b/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
@@ -261,48 +261,46 @@ hypothesis_gen:
     {{ problems }}
 
 hypothesis_select:
-  system: |- 
-        You are a Kaggle Grandmaster with deep expertise in model evaluation and decision making. 
-        Based on the given example, please select the most appropriate hypothesis from the candidates. 
-        These hypotheses are sourced from `model/data/feature/workflow`. Choose the one that best matches the intent or logic of the prompt. 
-        You are given the following hypothesis candidates:
-        {{ hypothesis_candidates }}
-        If multiple hypotheses seem reasonable, select the one that is most robust or consistent with Previous Experiments and Feedbacks, pay attention to the runtime of each loop.
-
-        If you believe that previous methods have reached their limits and the current setting only involves a single model, feel free to propose an ensemble solution. However, you **must** carefully allocate the training and runtime budget to ensure the **ensemble logic is well-executed and evaluated**, without compromising the performance of the previous models.
-
-        ### 1.Ensemble Core Principle
-        Your goal is not just to tune individual models, but to build an **effective ensemble**. Make design decisions that lead to **strong overall ensemble performance**, not just strong base models.
-        Please note: you are operating under a time budget dedicated to ensemble training of {{res_time}} seconds, and the maximum allowed time is {{ensemble_timeout}} seconds.
-        Assume training a single model takes about 1 hour. For example, if you have roughly twice that time left, you can try training multiple models with different random seeds or data splits to reuse time effectively.
-        If you have more time, you might consider training a multi-fold ensemble. Use your judgment to decide how many folds or seeds fit within your remaining time budget.
-
-        ### 2. Training-Time Resource Allocation
-        - You may use **multiple folds** if justified, but you must **ensure the full pipeline completes within runtime limits**.
-        - Avoid reducing base model quality just to save time. For example:
-          - Freezing large parts of the model (e.g., embeddings)
-          - Using only embedding-level regression instead of full modeling
-          - Using extreme simplifications like LoRA or tiny backbones if they degrade performance
-
-        ### 3. Expectation on Ensemble Design
-        - Implement an ensemble strategy that **improves performance**.
-          This can be as simple as training the same model with different random seeds or data splits and averaging the outputs.
-          More advanced methods like stacking or blending are optional and can be used if beneficial.
-          Choose a practical and reliable ensemble approach within the available time and resources.
-        - Consider the resource budget as a whole: a strong ensemble depends on both good base models and effective combination.
-
-        ### 4. Final Reminder
-        You have full access to the training code, task definition, and previous results.
-        You should weigh trade-offs thoughtfully and pick a design that **maximizes ensemble performance without shortcuts** that hurt model quality or cause timeout.
-        - The current time budget is sufficient for thorough training and ensemble.
-        - If you believe the existing single-model code is already good, avoid large modifications.
-        - Avoid overly strict constraints; focus on **effectively using available time** to build a **robust ensemble**.
-
-
-        {% if hypothesis_output_format is not none %}
-        ## Final Output Format in JSON Schema:
-        {{ hypothesis_output_format }}
-        {% endif %}
+  system: |-
+    You are a Kaggle Grandmaster with deep expertise in model evaluation and decision making. Based on the given example, please select the most appropriate hypothesis from the candidates. 
+    These hypotheses are sourced from `model/data/feature/workflow`. Choose the one that best matches the intent or logic of the prompt. 
+    You are given the following hypothesis candidates:
+    {{ hypothesis_candidates }}
+    If multiple hypotheses seem reasonable, select the one that is most robust or consistent with Previous Experiments and Feedbacks, pay attention to the runtime of each loop.
+
+    If you believe that previous methods have reached their limits and the current setting only involves a single model, feel free to propose an ensemble solution. However, you **must** carefully allocate the training and runtime budget to ensure the **ensemble logic is well-executed and evaluated**, without compromising the performance of the previous models.
+
+    ### 1. Ensemble Core Principle
+    Your goal is not just to tune individual models, but to build an **effective ensemble**. Make design decisions that lead to **strong overall ensemble performance**, not just strong base models.
+    Please note: you are operating under a time budget dedicated to ensemble training of {{res_time}} seconds, and the maximum allowed time is {{ensemble_timeout}} seconds.
+    Assume training a single model takes about 1 hour. For example, if you have roughly twice that time left, you can try training multiple models with different random seeds or data splits to reuse time effectively.
+    If you have more time, you might consider training a multi-fold ensemble. Use your judgment to decide how many folds or seeds fit within your remaining time budget.
+
+    ### 2. Training-Time Resource Allocation
+    - You may use **multiple folds** if justified, but you must **ensure the full pipeline completes within runtime limits**.
+    - Avoid reducing base model quality just to save time. For example:
+      - Freezing large parts of the model (e.g., embeddings)
+      - Using only embedding-level regression instead of full modeling
+      - Using extreme simplifications like LoRA or tiny backbones if they degrade performance
+
+    ### 3. Expectation on Ensemble Design
+    - Implement an ensemble strategy that **improves performance**.
+      This can be as simple as training the same model with different random seeds or data splits and averaging the outputs.
+      More advanced methods like stacking or blending are optional and can be used if beneficial.
+      Choose a practical and reliable ensemble approach within the available time and resources.
+    - Consider the resource budget as a whole: a strong ensemble depends on both good base models and effective combination.
+
+    ### 4. Final Reminder
+    You have full access to the training code, task definition, and previous results.
+    You should weigh trade-offs thoughtfully and pick a design that **maximizes ensemble performance without shortcuts** that hurt model quality or cause timeout.
+    - The current time budget is sufficient for thorough training and ensemble.
+    - If you believe the existing single-model code is already good, avoid large modifications.
+    - Avoid overly strict constraints; focus on **effectively using available time** to build a **robust ensemble**.
+
+    {% if hypothesis_output_format is not none %}
+    ## Final Output Format in JSON Schema:
+    {{ hypothesis_output_format }}
+    {% endif %}
 
 hypothesis_select:
   user: |- 
@@ -581,7 +579,7 @@ output_format:
       "problem name 2 (should be exactly same as the problem name provided)": 2, # The index which is same to the idea index provided in the input and must be integer.
     }
 
-  hypothesis_select: - 
+  hypothesis_select_format: |- 
     Choose the best hypothesis from the provided hypothesis candidates {{ hypothesis_candidates }}.  
     You must return a dictionary in the following format **for each selected hypothesis**:
 
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py b/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py
@@ -751,7 +751,7 @@ def hypothesis_select_with_llm(self,
                 hypothesis_candidates = hypothesis_candidates,
                 res_time = res_time,
                 ensemble_timeout = ensemble_timeout,
-                hypothesis_output_format = T(".prompts_v2:output_format.hypothesis_select").r(hypothesis_candidates = hypothesis_candidates)
+                hypothesis_output_format = T(".prompts_v2:output_format.hypothesis_select_format").r(hypothesis_candidates = hypothesis_candidates)
         )
 
         user_prompt = T(".prompts_v2:hypothesis_select.user").r(
@@ -988,7 +988,7 @@ def gen(
                                     timer=timer)
 
         if response_dict["component"] != "Ensemble":
-            new_hypothesis = DSHypothesis(component=hypothesis_dict[response_dict["hypothesis"]]["component"].get("component", "Model"),hypothesis=response_dict["hypothesis"])
+            new_hypothesis = DSHypothesis(component=hypothesis_dict[response_dict["hypothesis"]].get("component", "Model"),hypothesis=response_dict["hypothesis"])
         else:
             new_hypothesis = DSHypothesis(component=HypothesisComponent.Ensemble,hypothesis=response_dict["hypothesis"])