compspec
diff --git a/‎README.md
Lines changed: 37 additions & 8 deletions b/‎README.md
Lines changed: 37 additions & 8 deletions
diff --git a/‎examples/agent/README.md
Lines changed: 24 additions & 3 deletions b/‎examples/agent/README.md
Lines changed: 24 additions & 3 deletions
diff --git a/‎examples/agent/plans/run-lammps.yaml
Lines changed: 23 additions & 0 deletions b/‎examples/agent/plans/run-lammps.yaml
Lines changed: 23 additions & 0 deletions
diff --git a/‎fractale/agent/__init__.py
Lines changed: 4 additions & 1 deletion b/‎fractale/agent/__init__.py
Lines changed: 4 additions & 1 deletion
diff --git a/‎fractale/agent/base.py
Lines changed: 89 additions & 6 deletions b/‎fractale/agent/base.py
Lines changed: 89 additions & 6 deletions
@@ -9,13 +9,48 @@ This library is primarily being used for development for the descriptive thrust
 
 ## Design
 
-### Simple
+### Agents
+
+The `fractale agent` command provides means to run build, job generation, and deployment agents.
+This part of the library is under development.
+
+See [examples/agent](examples/agent) for an example.
+
+#### To do items
+
+- Get basic runner working
+- Add in ability to get log and optimize - the manager will need to use goal
+- We likely want the manager to be able to edit the prompt.
+ - should be provided with entire prompt?
+- When pod pending, it can be due to resource issues (and will never start). Right now we will time out, but we should be able to catch that earlier.
+
+#### Research Questions
+
+**And experiment ideas**
+
+- What are the increments of change (e.g., "adding a library")? We should be able to keep track of times for each stage and what changed, and an analyzer LLM can look at result and understand (categorize) most salient contributions to change.
+  - We also can time the time it takes to do subsequent changes, when relevant. For example, if we are building, we should be able to use cached layers (and the build times speed up) if the LLM is changing content later in the Dockerfile.
+- We can also save the successful results (Dockerfile builds, for example) and compare for similarity. How consistent is the LLM?
+- How does specificity of the prompt influence the result?
+- For an experiment, we would want to do a build -> deploy and successful run for a series of apps and get distributions of attempts, reasons for failure, and a general sense of similarity / differences.
+- For the optimization experiment, we'd want to do the same, but understand gradients of change that led to improvement.
+
+## Observations
+
+- Specifying cpu seems important - if you don't it wants to do GPU
+- If you ask for a specific example, it sometimes tries to download data (tell it where data is)
+- Always include common issues in the initial prompt
+- If you are too specific about instance types, it adds node selectors/affinity, and that often doesn't work.
+
+### Job Specifications
+
+#### Simple
 
 We provide a simple translation layer between job specifications. We take the assumption that although each manager has many options, the actual options a user would use is a much smaller set, and it's relatively straight forward to translate (and have better accuracy).
 
 See [examples/transform](examples/transform) for an example.
 
-### Complex
+#### Complex
 
 We want to:
 
@@ -32,12 +67,6 @@ For graph tool:
 conda install -c conda-forge graph-tool
 ```
 
-## Questions
-
-- Should other subsystem types have edges? How used?
-- Should we try to map them to nodes in the graph or use another means (or assume global across cluster nodes as we do now)?
-- Can we simplify spack subsystem graph (it's really big...)
-
 <!-- ⭐️ [Documentation](https://compspec.github.io/fractale) ⭐️ -->
 
 ## License
 
@@ -1,8 +1,10 @@
 # Agents
 
-Let's use fractale to run build, execute, and deploy agents. Right now we will run these a-la-carte, and eventually we will group them together to loop back and repeat steps if needed.
+Let's use fractale to run build, execute, and deploy agents. First now we will run these a-la-carte, and then we will group them together to be run by an agent to request steps when needed.
 
-## Build
+## A-la-carte
+
+### Build
 
 The build agent will use the Gemini API to generate a Dockerfile and then build until it succeeds. We would need subsequent agents to test it.
 Here is how to first ask the build agent to generate a lammps container for Google cloud.
@@ -13,7 +15,7 @@ fractale agent build lammps --environment "google cloud" --outfile dockerfile
 
 That might generate the [Dockerfile](Dockerfile) here, and a container that defaults to the application name "lammps"
 
-## Kubernetes Job
+### Kubernetes Job
 
 The kubernetes job agent agent will be asked to run a command, and will be provided the Dockerfile and name of the container. We assume that another agent (or you) have built and either pushed the image to a registry, or loaded it. Let's create our cluster and load the image:
 
@@ -28,3 +30,22 @@ To start, we will assume a kind cluster running and tell the agent the image is
 fractale agent kubernetes-job lammps --environment "google cloud CPU" --context-file ./Dockerfile --no-pull 
 ```
 
+## Manager
+
+Let's run with a manager. Using a manager means we provide a plan along with a goal. The manager itself takes on a similar structure to a step agent, but it has a high level goal. The manager will follow the high level structure of the plan, and step
+managers can often run independently for some number of attempts. If a step manager
+returns after these attempts still with a failure, or if the last step is a failure,
+the manager can decide how to act. For example, if a Kubernetes job deploys but fails,
+it could be due to the Dockerfile build (the build manager) or the manifest for the Job.
+The Job manager needs to prepare the updated context to return to this step, and then
+try again.
+
+```bash
+fractale agent --plan ./plans/run-lammps.yaml
+```
+
+For this first design, we are taking an approach where we only re-assess the state and go back to a previous step given a last step failure. The assumption is that if a previous step fails, we keep trying until it succeeds. We only need to backtrack if the last step in a sequence is not successful, and it is due to failure at some stage in the process. But I do think we have a few options:
+
+1. Allow the manager to decide what to do on _every_ step (likely not ideal)
+2. Allow step managers to execute until success, always (too much issue if a step is failing because of dependency)
+3. Allow step managers to execute until success unless a limit is set, and then let the manager take over (in other words, too many failures means we hand it back to the manager to look.)
@@ -0,0 +1,23 @@
+name: Build and Deploy LAMMPS
+description: Build a Docker container and deploy it as a Kubernetes Job.
+plan:
+- agent: build
+  context:
+    environment: "google cloud CPU instance in Kubernetes"
+    application: lammps
+    details: |
+      Please build the with reaxff HNS example located in examples/reaxff/HNS.
+      Make sure to keep the full content of that directory as is, and
+      the target file is named in.reaxff.hns. To make it easy for the
+      next agent, please make the working directory where the data is
+      located.
+
+- agent: kubernetes-job
+  context:
+    no_pull: true
+    environment: "google cloud CPU instance in Kubernetes" 
+    details: |
+      Please execute the reaxff HNS example, and assume the data in the PWD,
+      Run lammpss with params -v x 2 -v y 2 -v z 2 -in ./in.reaxff.hns
+      and with -nocite flag for CPU. Do not try to generate configmap data.
+      Do not add any nodeSelector or affinity rules since we are testing.
@@ -1,6 +1,9 @@
 from fractale.agent.build import BuildAgent
 from fractale.agent.kubernetes_job import KubernetesJobAgent
+from fractale.agent.manager import ManagerAgent
 
 
 def get_agents():
-    return {"build": BuildAgent, "kubernetes-job": KubernetesJobAgent}
+    # The Manager Agent is a special kind that can orchestrate other managers.
+    # We could technically nest them.
+    return {"build": BuildAgent, "kubernetes-job": KubernetesJobAgent, "manager": ManagerAgent}
@@ -1,22 +1,105 @@
+import fractale.utils as utils
+
+
 class Agent:
     """
-    A base for an agent
+    A base for an agent. Each agent should:
+
+    1. Have a run function that accepts (and modifies) a context.
+    2. Set the context.result as the final result (or set to None)
+      - On failure, set context.result to the error context.
+    3. Set any other variables needed by future build steps.
+    4. Have a get_prompt function that takes the same input as run,
+       but returns the prompt for the LLM. This can be modified by
+       the manager.
+    5. Receiving a prompt in the context should use it instead.
+       This indicates the manager has set it.
     """
+
     # name and description should be on the class
 
-    def __init__(self):   
-        self.attempts = None
-        
+    def __init__(self):
+
+        # Max attempts defaults to unlimited
+        # We start counting at 1 for the user to see.
+        # Eat your heart out, Matlab.
+        self.attempts = 1
+        self.max_attempts = None
+
+        # Custom initialization functions
+        self.init()
+
+    def init(self):
+        pass
+
+    def return_on_failure(self):
+        """
+        On failure, have we reached max attempts and should return?
+        """
+        if not self.max_attempts:
+            return False
+        return self.attempts > self.max_attempts
+
+    def set_max_attempts(self, max_attempts):
+        self.max_attempts = max_attempts
+
     def add_arguments(self, subparser):
         """
         Add arguments for the agent to show up in argparse
 
         This is added by the plugin class
         """
-        pass
+        assert subparser
 
-    def run(self, args, extra):
+    def write_file(self, context, content, add_comment=True):
+        """
+        Shared function to write content to file, if context.outfile is defined.
+        """
+        outfile = context.get("outfile")
+        if not outfile:
+            return
+        # Add generation line
+        if add_comment:
+            content += f"\n# Generated by fractale {self.name} agent"
+        utils.write_file(content, outfile)
+
+    def ask_gemini(self, prompt):
+        """
+        Ask gemini adds a wrapper with some error handling.
+        """
+        try:
+            response = self.chat.send_message(prompt)
+
+            # This line can fail. If it succeeds, return entire response
+            text_content = response.text
+            assert text_content
+            return response
+
+        except ValueError as e:
+            print(f"[Error] The API response was blocked and contained no text: {str(e)}")
+
+            print("VANESSA DEBUG WHAT TO DO")
+            import IPython
+
+            IPython.embed()
+            # We probably want to retry if it is 1 (STOP) and empty.
+            # Otherwise we need to somehow retry fixing the dockerfile.
+            # For robust logging, you can inspect the reason.
+            if response.candidates:
+                finish_reason = response.candidates[0].finish_reason.name
+                print(f"Finish Reason: {finish_reason}")
+
+    def run(self, context):
         """
         Run the agent.
         """
+        assert context
         raise NotImplementedError(f"The {self.name} agent is missing a 'run' function")
+
+    def get_prompt(self, context):
+        """
+        This function should take the same context as run and return the parsed prompt that
+        would be used. We do this so we can hand it to the manager for tweaking.
+        """
+        assert context
+        raise NotImplementedError(f"The {self.name} agent is missing a 'get_prompt' function")