allow updating additional instructions from a file after the run started

ZhengyaoJiang · ZhengyaoJiang · commit a6079f6e3035 · 2025-04-19T17:16:04.000+01:00
diff --git a/examples/prompt/eval.py b/examples/prompt/eval.py
@@ -22,7 +22,7 @@
 TOTAL_SAMPLES = 30  # how many problems to load
 NUM_WORKERS = 30  # concurrent LLM calls
 LOG_EVERY = 5  # print progress after this many
-MODEL_TO_USE = "gpt-4.1-mini" # Define the model to use HERE
+MODEL_TO_USE = "gpt-4.1" # Define the model to use HERE
 TASK_TIMEOUT = 300  # seconds per LLM call
 # ---------------------------------------------------------------------
 
diff --git a/examples/prompt/optimize.py b/examples/prompt/optimize.py
@@ -22,7 +22,6 @@
 Solution:
 """
 
-# Modify the function signature to accept model_name
 def solve(problem: str, model_name: str) -> str:
     """Return the model's raw text answer for one problem using the specified model."""
     prompt = PROMPT_TEMPLATE.format(problem=problem)
diff --git a/examples/prompt/prompt_guide.md b/examples/prompt/prompt_guide.md
@@ -2,7 +2,7 @@
 
 ## 1. Goal
 
-Your objective is to modify the `PROMPT_TEMPLATE` string within the `optimize.py` file to improve the `accuracy` metric when solving AIME math problems. The modifications should leverage the capabilities of the target model, **GPT-4.1**.
+Your objective is to modify the the `optimize.py` file to improve the `accuracy` metric when solving AIME math problems. The modifications should leverage the capabilities of the target model, **GPT-4.1**.
 
 ## 2. Files and Workflow
 
@@ -25,21 +25,21 @@ You are optimizing the prompt for `gpt-4.1`. Based on its characteristics, consi
 
 ## 4. Optimization Strategies (Focus on `PROMPT_TEMPLATE` in `optimize.py`)
 
-The primary goal is to enhance the model's reasoning process for these challenging math problems. Focus on **complex Chain-of-Thought (CoT)** designs within the `PROMPT_TEMPLATE`.
+The primary goal is to enhance the model's reasoning process for these challenging math problems. Focus on Chain-of-Thought (CoT) designs within the `PROMPT_TEMPLATE`.
 
 **Ideas to Explore:**
 You don't have to implement all of them, but the following ideas might be helpful:
-*   **Workflow Patterns:**
-    *  **Linear**: step-by-step thinking process could be a good starting point E.g., "1. Understand the problem constraints. 2. Identify relevant theorems/formulas. 3. Formulate a plan. 4. Execute calculations step-by-step. 5. Verify intermediate results. 6. State the final answer in the required format."
+*   **Workflow Patterns** try to use some of the following patterns:
+    *  **Linear**: Linear workflow, standarded CoT E.g. considering the following thinking steps (you don't have to include all of them), "1. Understand the problem constraints. 2. Identify relevant theorems/formulas. 3. Formulate a plan. 4. Execute calculations step-by-step. 5. Verify intermediate results. 6. State the final answer in the required format."
     *  **List Candidates**: You can ask the model to propose a few solutions in a particular step and pick the best solution. You can potentially also set the criterias in the prompt.
-    *  **Code** Write pesudo code to define even more complex workflows with loops, conditional statement, or go to statement.
+    *  **Code** Use pesudo code to define even more complex workflows with loops, conditional statement, or go to statement.
 *   **Other CoT Techniques:**
     *   Self-Correction/Reflection
     *   Plan Generation
     *   Debate, simulating multiple characters
     *   Tree of thought
-*   **Few-Shot Examples:** You *could* experiment with adding 1-2 high-quality AIME problem/solution examples directly into the `PROMPT_TEMPLATE` string (similar to how Weco attempted in one of the runs). Ensure the examples clearly show the desired reasoning style and the final `\boxed{XXX}` format. *Caution: This significantly increases prompt length and cost.*
+*   **Few-Shot Examples:** You *could* experiment with adding 1-2 high-quality AIME problem/solution examples directly into the `PROMPT_TEMPLATE` string (similar to how Weco attempted in one of the runs). Ensure the examples clearly show the desired reasoning style and the final `\boxed{XXX}` format.
 *   **Play with format:** The way you format the prompt. Markdown, xml, json, code or natural language. Similarly for the thinking tokens themselves you can also try out different formats.
 
 ## 5. Constraints
-*   **Ensure the final output reliably contains `\boxed{XXX}` as the evaluation script depends on it.**
+*   **Ensure the final output reliably contains `\boxed{XXX}` as the evaluation script depends on it.**
diff --git a/weco/cli.py b/weco/cli.py
@@ -74,15 +74,16 @@ def main() -> None:
                 "debug_prob": 0.5,
                 "max_debug_depth": max(1, math.ceil(0.1 * steps)),  # 10% of steps
             }
+            # Read API keys
+            api_keys = read_api_keys_from_env()
+            # API request timeout
+            timeout = 800
+
             # Read additional instructions
             additional_instructions = read_additional_instructions(additional_instructions=args.additional_instructions)
             # Read source code
             source_fp = pathlib.Path(args.source)
             source_code = read_from_path(fp=source_fp, is_json=False)
-            # Read API keys
-            api_keys = read_api_keys_from_env()
-            # API request timeout
-            timeout = 800
 
         # Initialize panels
         summary_panel = SummaryPanel(
@@ -193,12 +194,14 @@ def main() -> None:
             )
 
             for step in range(1, steps):
+                # Re-read instructions from the original source (file path or string) BEFORE each suggest call
+                current_additional_instructions = read_additional_instructions(additional_instructions=args.additional_instructions)
                 # Evaluate the current output and get the next solution
                 eval_and_next_solution_response = evaluate_feedback_then_suggest_next_solution(
                     console=console,
                     session_id=session_id,
                     execution_output=term_out,
-                    additional_instructions=additional_instructions,
+                    additional_instructions=current_additional_instructions,
                     api_keys=api_keys,
                     timeout=timeout,
                 )
@@ -286,12 +289,14 @@ def main() -> None:
                     transition_delay=0.1,  # Slightly longer delay for evaluation results
                 )
 
+            # Re-read instructions before the final feedback step
+            current_additional_instructions = read_additional_instructions(additional_instructions=args.additional_instructions)
             # Ensure we pass evaluation results for the last step's generated solution
             eval_and_next_solution_response = evaluate_feedback_then_suggest_next_solution(
                 console=console,
                 session_id=session_id,
                 execution_output=term_out,
-                additional_instructions=additional_instructions,
+                additional_instructions=current_additional_instructions,
                 api_keys=api_keys,
                 timeout=timeout,
             )