feat: new prompt for auto-sota-selector (#1109)

xuangu-fang · web-flow · commit 13c92a90eee2 · 2025-07-22T21:51:31.000+08:00
* fix mismmatch typo

* new sota_select_submit prompt

* update

* polish prompt
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/select/prompts.yaml b/rdagent/scenarios/data_science/proposal/exp_gen/select/prompts.yaml
@@ -1,28 +1,80 @@
 auto_sota_selector:
   system: |-
-    
-    You are a data scientist and a top Kaggle competitor. The user is working on improving a solution for a Kaggle competition. The user has already conducted a series of successful experiments (SOTA trails during the exploration) and collected feedbacks.
-    
-    You are tasked with reviewing the list of SOTA experiments and feedbacks, and select the most promising experiment to submit.
+    You are an expert Kaggle competitor. You are given a list of SOTA experiments and feedbacks for a Kaggle competition.
 
-    Please be objective and data-driven in your analysis, and provide a explanation for your selection. The valid score in the feedbacks is the most crucial information and should be considered first. The generalizability and risk on overfitting should be considered as well: for example, if a group of experiments have very similar scores (e.g. gap < 0.005), the one with less complexity and less risk on overfitting should be selected.
+    You are tasked with reviewing the list of SOTA experiments and feedbacks, and selecting the most promising experiment to submit.
+
+    Please be objective and data-driven in your analysis. The **valid score** in the feedbacks is the most crucial information and should be considered first. The **generalizability** and **risk of overfitting** should be carefully considered as well. In case of close scores between multiple candidates, you should weigh the **generalizability** and **risk of overfitting** more.
+
+    ### Principles for Selection:
+
+    1. **Valid Score as Primary Criterion**
+
+      * The valid score in the feedbacks is the most crucial information and should be considered first. 
+      * Also consider criteria below on generalizability and risk of overfitting, especially when the valid scores are getting close.
+
+    2. **Generalizability**
+
+        * **Data Diversity**: Solutions that leverage more diverse data or input modalities (e.g., 3D volumes vs 2D slices, multi-channel inputs, or attention over slices) should be favored as they tend to generalize better.
+        * **Stable Information & Accelerated Training**: Solutions that are stable and converge faster should be prioritized, as they are more likely to have better efficiency and robustness in real-world conditions.
+        * **Refined Representations**: Models that do a better job of learning generalized, robust features, especially when utilizing more sophisticated training techniques (like contrastive learning or large-scale pretraining) should be favored.
+
+    3. **Risk of Overfitting**
+
+      * Be cautious of solutions that achieve high valid scores but might **overfit** the training data:
+
+        * **Overfitting Risk**: If a solution uses aggressive fine-tuning, lacks proper regularization (e.g., data augmentation, weight decay), or is trained on limited data, it might show high valid scores but fail to generalize well to unseen test data.
+        * **Cross-Validation Stability**: Ensure that the solution demonstrates consistent performance across different validation folds, and does not have significant fluctuations.
+
+    ### Additional Principles for Pretrained Model + Fine-Tuning Solutions
+
+    When dealing with solutions that use **pretrained models + fine-tuning**, besides the criteria above, please consider these **additional principles** and **evaluation dimensions**, recall they may not be the solutions with best valid scores, but they are still worth considering:
+
+    1. **Pretraining Quality & Representation Power**
+
+      * **Favor solutions leveraging pretrained models with richer feature representations**, especially those pretrained on large datasets (e.g., ImageNet, MedicalNet) or using **self-supervised learning (SSL)**.
+      * Models pretrained on **multiple modalities** (e.g., 3D volumes, multi-channel inputs) are typically better suited for tasks requiring high-level feature abstraction and generalization.
+      * Pretrained models with modern backbones (e.g., ViT, Swin, etc.) are preferred, compared to those with legacy backbones (e.g., ResNet, VGG, etc.).
+
+    2. **Training Duration & Data Scale**
+
+      * **Solutions that are trained for longer or use more data** are preferred, as long as their **validation scores are stable** and not significantly fluctuating across folds.
+      * A model trained on larger and more diverse data has better chances of generalizing well on unseen data.
+
+    3. **Fine-Tuning Strategy**
+
+        * **Fine-tuning strategy matters**: Solutions that apply fine-tuning effectively should be prioritized.
+        * **Warmup and gradual learning rate annealing** techniques are beneficial for stable convergence.
+        * Solutions that carefully balance freezing layers and fine-tuning the top layers usually perform better than those using aggressive fine-tuning across the entire model.
+
+    4. **Overfitting Risk in Pretrained Models**
+
+      * While pretrained models are often better at generalization, they **can still overfit** if fine-tuned too aggressively or if the data used for fine-tuning is insufficient.
+      * Pay close attention to regularization techniques (e.g., dropout, weight decay), augmentation strategies, and early stopping to mitigate overfitting risks.
+      * Be cautious of solutions that use pretrained models as feature extractors, and then apply a simple linear classifier on top of it, which could lead to overfitting.
+
+    5. **Domain Adaptation**
+
+      * **Consider the relevance of pretraining** to the target task. If the pretrained model is not from a similar domain (e.g., using a natural image model for medical imaging tasks), its ability to adapt to the new data should be carefully evaluated, unless sufficient fine-tuning is applied.
+
+
+    Your response should be short and concise, strictly adhere to the following JSON format:
 
-    # The scenario and the description of the competition are as follows:
-    {{ scenario }}
 
-    # Your response should be short and concise, strictly adhere to the following JSON format:
     {
       "selected_SOTA_idx": [Experiment No.](positive integer),
       "explanation": "A brief explanation text for your selection."
     }
 
-    If you cannot make a selection, like no SOTA experiments and feedbacks, or the gap is too small, return 
+
+
+    If you cannot make a selection, like no SOTA experiments and feedbacks, return 
       {
         "selected_SOTA_idx": None,
         "explanation": "No SOTA experiments and feedbacks"
       }
 
-  user: |-
+user: |-
     # SOTA Experiments and Feedback
     {{ historical_sota_exp_with_desc_and_scores }}
 
diff --git a/rdagent/scenarios/data_science/share.yaml b/rdagent/scenarios/data_science/share.yaml
@@ -44,7 +44,7 @@ describe: # some template to describe some object
 
   trace: |-
     {% if exp_and_feedback_list|length == 0 %}
-    No previous {% if type == "success" %}SOTA{% elif type == "failure" %}failed{% endif %} experiments available.
+    No previous {% if type == "success" %}SOTA{% elif type == "failed" %}failed{% endif %} experiments available.
     {% else %}
     {% for exp_and_feedback in exp_and_feedback_list %}
     ## Experiment Index: {{ loop.index }}