Skip to content

Commit 1cb2b21

Browse files
committed
manager: make standalone manager
Signed-off-by: vsoch <[email protected]>
1 parent ed47b15 commit 1cb2b21

File tree

19 files changed

+823
-366
lines changed

19 files changed

+823
-366
lines changed

README.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,48 @@ This library is primarily being used for development for the descriptive thrust
99

1010
## Design
1111

12-
### Simple
12+
### Agents
13+
14+
The `fractale agent` command provides means to run build, job generation, and deployment agents.
15+
This part of the library is under development.
16+
17+
See [examples/agent](examples/agent) for an example.
18+
19+
#### To do items
20+
21+
- Get basic runner working
22+
- Add in ability to get log and optimize - the manager will need to use goal
23+
- We likely want the manager to be able to edit the prompt.
24+
- should be provided with entire prompt?
25+
- When pod pending, it can be due to resource issues (and will never start). Right now we will time out, but we should be able to catch that earlier.
26+
27+
#### Research Questions
28+
29+
**And experiment ideas**
30+
31+
- What are the increments of change (e.g., "adding a library")? We should be able to keep track of times for each stage and what changed, and an analyzer LLM can look at result and understand (categorize) most salient contributions to change.
32+
- We also can time the time it takes to do subsequent changes, when relevant. For example, if we are building, we should be able to use cached layers (and the build times speed up) if the LLM is changing content later in the Dockerfile.
33+
- We can also save the successful results (Dockerfile builds, for example) and compare for similarity. How consistent is the LLM?
34+
- How does specificity of the prompt influence the result?
35+
- For an experiment, we would want to do a build -> deploy and successful run for a series of apps and get distributions of attempts, reasons for failure, and a general sense of similarity / differences.
36+
- For the optimization experiment, we'd want to do the same, but understand gradients of change that led to improvement.
37+
38+
## Observations
39+
40+
- Specifying cpu seems important - if you don't it wants to do GPU
41+
- If you ask for a specific example, it sometimes tries to download data (tell it where data is)
42+
- Always include common issues in the initial prompt
43+
- If you are too specific about instance types, it adds node selectors/affinity, and that often doesn't work.
44+
45+
### Job Specifications
46+
47+
#### Simple
1348

1449
We provide a simple translation layer between job specifications. We take the assumption that although each manager has many options, the actual options a user would use is a much smaller set, and it's relatively straight forward to translate (and have better accuracy).
1550

1651
See [examples/transform](examples/transform) for an example.
1752

18-
### Complex
53+
#### Complex
1954

2055
We want to:
2156

@@ -32,12 +67,6 @@ For graph tool:
3267
conda install -c conda-forge graph-tool
3368
```
3469

35-
## Questions
36-
37-
- Should other subsystem types have edges? How used?
38-
- Should we try to map them to nodes in the graph or use another means (or assume global across cluster nodes as we do now)?
39-
- Can we simplify spack subsystem graph (it's really big...)
40-
4170
<!-- ⭐️ [Documentation](https://compspec.github.io/fractale) ⭐️ -->
4271

4372
## License

examples/agent/README.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# Agents
22

3-
Let's use fractale to run build, execute, and deploy agents. Right now we will run these a-la-carte, and eventually we will group them together to loop back and repeat steps if needed.
3+
Let's use fractale to run build, execute, and deploy agents. First now we will run these a-la-carte, and then we will group them together to be run by an agent to request steps when needed.
44

5-
## Build
5+
## A-la-carte
6+
7+
### Build
68

79
The build agent will use the Gemini API to generate a Dockerfile and then build until it succeeds. We would need subsequent agents to test it.
810
Here is how to first ask the build agent to generate a lammps container for Google cloud.
@@ -13,7 +15,7 @@ fractale agent build lammps --environment "google cloud" --outfile dockerfile
1315

1416
That might generate the [Dockerfile](Dockerfile) here, and a container that defaults to the application name "lammps"
1517

16-
## Kubernetes Job
18+
### Kubernetes Job
1719

1820
The kubernetes job agent agent will be asked to run a command, and will be provided the Dockerfile and name of the container. We assume that another agent (or you) have built and either pushed the image to a registry, or loaded it. Let's create our cluster and load the image:
1921

@@ -28,3 +30,22 @@ To start, we will assume a kind cluster running and tell the agent the image is
2830
fractale agent kubernetes-job lammps --environment "google cloud CPU" --context-file ./Dockerfile --no-pull
2931
```
3032

33+
## Manager
34+
35+
Let's run with a manager. Using a manager means we provide a plan along with a goal. The manager itself takes on a similar structure to a step agent, but it has a high level goal. The manager will follow the high level structure of the plan, and step
36+
managers can often run independently for some number of attempts. If a step manager
37+
returns after these attempts still with a failure, or if the last step is a failure,
38+
the manager can decide how to act. For example, if a Kubernetes job deploys but fails,
39+
it could be due to the Dockerfile build (the build manager) or the manifest for the Job.
40+
The Job manager needs to prepare the updated context to return to this step, and then
41+
try again.
42+
43+
```bash
44+
fractale agent --plan ./plans/run-lammps.yaml
45+
```
46+
47+
For this first design, we are taking an approach where we only re-assess the state and go back to a previous step given a last step failure. The assumption is that if a previous step fails, we keep trying until it succeeds. We only need to backtrack if the last step in a sequence is not successful, and it is due to failure at some stage in the process. But I do think we have a few options:
48+
49+
1. Allow the manager to decide what to do on _every_ step (likely not ideal)
50+
2. Allow step managers to execute until success, always (too much issue if a step is failing because of dependency)
51+
3. Allow step managers to execute until success unless a limit is set, and then let the manager take over (in other words, too many failures means we hand it back to the manager to look.)

examples/agent/plans/run-lammps.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: Build and Deploy LAMMPS
2+
description: Build a Docker container and deploy it as a Kubernetes Job.
3+
plan:
4+
- agent: build
5+
context:
6+
environment: "google cloud CPU instance in Kubernetes"
7+
application: lammps
8+
details: |
9+
Please build the with reaxff HNS example located in examples/reaxff/HNS.
10+
Make sure to keep the full content of that directory as is, and
11+
the target file is named in.reaxff.hns. To make it easy for the
12+
next agent, please make the working directory where the data is
13+
located.
14+
15+
- agent: kubernetes-job
16+
context:
17+
no_pull: true
18+
environment: "google cloud CPU instance in Kubernetes"
19+
details: |
20+
Please execute the reaxff HNS example, and assume the data in the PWD,
21+
Run lammpss with params -v x 2 -v y 2 -v z 2 -in ./in.reaxff.hns
22+
and with -nocite flag for CPU. Do not try to generate configmap data.
23+
Do not add any nodeSelector or affinity rules since we are testing.

fractale/agent/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
from fractale.agent.build import BuildAgent
22
from fractale.agent.kubernetes_job import KubernetesJobAgent
3+
from fractale.agent.manager import ManagerAgent
34

45

56
def get_agents():
6-
return {"build": BuildAgent, "kubernetes-job": KubernetesJobAgent}
7+
# The Manager Agent is a special kind that can orchestrate other managers.
8+
# We could technically nest them.
9+
return {"build": BuildAgent, "kubernetes-job": KubernetesJobAgent, "manager": ManagerAgent}

fractale/agent/base.py

Lines changed: 89 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,105 @@
1+
import fractale.utils as utils
2+
3+
14
class Agent:
25
"""
3-
A base for an agent
6+
A base for an agent. Each agent should:
7+
8+
1. Have a run function that accepts (and modifies) a context.
9+
2. Set the context.result as the final result (or set to None)
10+
- On failure, set context.result to the error context.
11+
3. Set any other variables needed by future build steps.
12+
4. Have a get_prompt function that takes the same input as run,
13+
but returns the prompt for the LLM. This can be modified by
14+
the manager.
15+
5. Receiving a prompt in the context should use it instead.
16+
This indicates the manager has set it.
417
"""
18+
519
# name and description should be on the class
620

7-
def __init__(self):
8-
self.attempts = None
9-
21+
def __init__(self):
22+
23+
# Max attempts defaults to unlimited
24+
# We start counting at 1 for the user to see.
25+
# Eat your heart out, Matlab.
26+
self.attempts = 1
27+
self.max_attempts = None
28+
29+
# Custom initialization functions
30+
self.init()
31+
32+
def init(self):
33+
pass
34+
35+
def return_on_failure(self):
36+
"""
37+
On failure, have we reached max attempts and should return?
38+
"""
39+
if not self.max_attempts:
40+
return False
41+
return self.attempts > self.max_attempts
42+
43+
def set_max_attempts(self, max_attempts):
44+
self.max_attempts = max_attempts
45+
1046
def add_arguments(self, subparser):
1147
"""
1248
Add arguments for the agent to show up in argparse
1349
1450
This is added by the plugin class
1551
"""
16-
pass
52+
assert subparser
1753

18-
def run(self, args, extra):
54+
def write_file(self, context, content, add_comment=True):
55+
"""
56+
Shared function to write content to file, if context.outfile is defined.
57+
"""
58+
outfile = context.get("outfile")
59+
if not outfile:
60+
return
61+
# Add generation line
62+
if add_comment:
63+
content += f"\n# Generated by fractale {self.name} agent"
64+
utils.write_file(content, outfile)
65+
66+
def ask_gemini(self, prompt):
67+
"""
68+
Ask gemini adds a wrapper with some error handling.
69+
"""
70+
try:
71+
response = self.chat.send_message(prompt)
72+
73+
# This line can fail. If it succeeds, return entire response
74+
text_content = response.text
75+
assert text_content
76+
return response
77+
78+
except ValueError as e:
79+
print(f"[Error] The API response was blocked and contained no text: {str(e)}")
80+
81+
print("VANESSA DEBUG WHAT TO DO")
82+
import IPython
83+
84+
IPython.embed()
85+
# We probably want to retry if it is 1 (STOP) and empty.
86+
# Otherwise we need to somehow retry fixing the dockerfile.
87+
# For robust logging, you can inspect the reason.
88+
if response.candidates:
89+
finish_reason = response.candidates[0].finish_reason.name
90+
print(f"Finish Reason: {finish_reason}")
91+
92+
def run(self, context):
1993
"""
2094
Run the agent.
2195
"""
96+
assert context
2297
raise NotImplementedError(f"The {self.name} agent is missing a 'run' function")
98+
99+
def get_prompt(self, context):
100+
"""
101+
This function should take the same context as run and return the parsed prompt that
102+
would be used. We do this so we can hand it to the manager for tweaking.
103+
"""
104+
assert context
105+
raise NotImplementedError(f"The {self.name} agent is missing a 'get_prompt' function")

0 commit comments

Comments
 (0)