Skip to content

Commit b447d25

Browse files
authored
Merge pull request #11 from compspec/add-agent-manager
Add agent manager
2 parents b4eca25 + d597c2b commit b447d25

28 files changed

+1697
-481
lines changed

README.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,29 @@ This library is primarily being used for development for the descriptive thrust
99

1010
## Design
1111

12-
### Simple
12+
### Agents
13+
14+
The `fractale agent` command provides means to run build, job generation, and deployment agents.
15+
This part of the library is under development. There are three kinds of agents:
16+
17+
- `step` agents are experts on doing specific tasks (do hold state)
18+
- `manager` agents know how to orchestrate step agents and choose between them (don't hold state, but could)
19+
- `helper` agents are used by step agents to do small tasks (e.g., suggest a fix for an error)
20+
21+
The design is simple in that each agent is responding to state of error vs. success. In the case of a step agent, the return code determines to continue or try again. In the case of a helper, the input is typically an erroneous response (or something that needs changing) with respect to a goal.
22+
For a manager, we are making a choice based on a previous erroneous step.
23+
24+
See [examples/agent](examples/agent) for an example, along with observations, research questions, ideas, and experiment brainstorming!
25+
26+
### Job Specifications
27+
28+
#### Simple
1329

1430
We provide a simple translation layer between job specifications. We take the assumption that although each manager has many options, the actual options a user would use is a much smaller set, and it's relatively straight forward to translate (and have better accuracy).
1531

1632
See [examples/transform](examples/transform) for an example.
1733

18-
### Complex
34+
#### Complex
1935

2036
We want to:
2137

@@ -32,12 +48,6 @@ For graph tool:
3248
conda install -c conda-forge graph-tool
3349
```
3450

35-
## Questions
36-
37-
- Should other subsystem types have edges? How used?
38-
- Should we try to map them to nodes in the graph or use another means (or assume global across cluster nodes as we do now)?
39-
- Can we simplify spack subsystem graph (it's really big...)
40-
4151
<!-- ⭐️ [Documentation](https://compspec.github.io/fractale) ⭐️ -->
4252

4353
## License

examples/agent/Dockerfile

Lines changed: 71 additions & 123 deletions
Original file line numberDiff line numberDiff line change
@@ -1,123 +1,71 @@
1-
# Dockerfile for LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator)
2-
# Target: Production HPC environment on Google Cloud
3-
# Strategy: Multi-stage build for a lean final image with MPI support.
4-
5-
# Use ARGs at the top to easily update versions of key components globally
6-
ARG LAMMPS_VERSION=stable_2Aug2023
7-
ARG OPENMPI_VERSION=4.1.6
8-
9-
# =====================================================================
10-
# Stage 1: Builder
11-
# This stage compiles Open MPI and LAMMPS from source. It contains all
12-
# the build-time dependencies, which will be discarded later.
13-
# =====================================================================
14-
FROM debian:bullseye AS builder
15-
16-
# Inherit ARGs from the global scope
17-
ARG LAMMPS_VERSION
18-
ARG OPENMPI_VERSION
19-
20-
# Set environment variables for the Open MPI build location and path
21-
ENV OMPI_DIR=/opt/openmpi-${OPENMPI_VERSION}
22-
ENV PATH=$OMPI_DIR/bin:$PATH
23-
ENV LD_LIBRARY_PATH=$OMPI_DIR/lib
24-
25-
# Prevent interactive prompts during package installation
26-
ENV DEBIAN_FRONTEND=noninteractive
27-
28-
# Install essential build tools and libraries for both Open MPI and LAMMPS
29-
# Added ca-certificates to allow git and wget to verify SSL certificates securely.
30-
RUN apt-get update && apt-get install -y --no-install-recommends \
31-
build-essential \
32-
ca-certificates \
33-
cmake \
34-
g++ \
35-
gfortran \
36-
git \
37-
libevent-dev \
38-
libhwloc-dev \
39-
wget \
40-
&& rm -rf /var/lib/apt/lists/*
41-
42-
# --- Build Open MPI from source ---
43-
# Building from source gives control over the configuration, crucial for
44-
# containerized HPC environments. We enable PMIx for modern process management.
45-
WORKDIR /tmp
46-
RUN wget https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION%.*}/openmpi-${OPENMPI_VERSION}.tar.gz && \
47-
tar -xzf openmpi-${OPENMPI_VERSION}.tar.gz
48-
49-
WORKDIR /tmp/openmpi-${OPENMPI_VERSION}
50-
RUN ./configure \
51-
--prefix=${OMPI_DIR} \
52-
--with-pmix \
53-
--disable-pty-support
54-
RUN make -j$(nproc) all && make install
55-
56-
# --- Build LAMMPS from source ---
57-
# Clone a specific stable release tag for reproducibility.
58-
WORKDIR /opt
59-
RUN git clone --depth 1 --branch ${LAMMPS_VERSION} https://github.com/lammps/lammps.git lammps
60-
61-
# Use CMake to configure the LAMMPS build. Enable common packages.
62-
WORKDIR /opt/lammps/build
63-
RUN cmake ../cmake \
64-
-D CMAKE_INSTALL_PREFIX=/usr/local \
65-
-D BUILD_MPI=yes \
66-
-D PKG_KSPACE=yes \
67-
-D PKG_MOLECULE=yes \
68-
-D PKG_RIGID=yes \
69-
-D PKG_MANYBODY=yes \
70-
-D PKG_REPLICA=yes \
71-
-D CMAKE_BUILD_TYPE=Release \
72-
-D LAMMPS_EXCEPTIONS=yes
73-
74-
# Compile and install LAMMPS
75-
RUN make -j$(nproc) && make install
76-
77-
# =====================================================================
78-
# Stage 2: Final Image
79-
# This stage creates the lean, final image. It starts from a minimal
80-
# base and only copies the necessary executables, libraries, and runtime
81-
# dependencies from the builder stage.
82-
# =====================================================================
83-
FROM debian:bullseye-slim
84-
85-
# Inherit ARG for version consistency
86-
ARG OPENMPI_VERSION
87-
88-
# Set environment variables for Open MPI runtime
89-
ENV OMPI_DIR=/opt/openmpi-${OPENMPI_VERSION}
90-
ENV PATH=/usr/local/bin:$OMPI_DIR/bin:$PATH
91-
92-
# Install only the essential runtime dependencies.
93-
# libgfortran5 is required by the Fortran-compiled parts of LAMMPS.
94-
RUN apt-get update && apt-get install -y --no-install-recommends \
95-
libevent-2.1-7 \
96-
libgfortran5 \
97-
libhwloc15 \
98-
&& rm -rf /var/lib/apt/lists/*
99-
100-
# Copy the compiled Open MPI installation from the builder stage
101-
COPY --from=builder ${OMPI_DIR} ${OMPI_DIR}
102-
103-
# Copy the entire LAMMPS installation (binary, libs, potentials) from the builder stage
104-
COPY --from=builder /usr/local /usr/local
105-
106-
# Configure the dynamic linker to find Open MPI and LAMMPS libraries.
107-
# This is more robust than setting LD_LIBRARY_PATH.
108-
RUN echo "${OMPI_DIR}/lib" > /etc/ld.so.conf.d/openmpi.conf && \
109-
echo "/usr/local/lib" > /etc/ld.so.conf.d/lammps.conf && \
110-
ldconfig
111-
112-
# Create a dedicated, non-root user for running the application for security
113-
RUN useradd --create-home --shell /bin/bash lammps
114-
USER lammps
115-
WORKDIR /home/lammps
116-
117-
# Set the entrypoint to the LAMMPS executable.
118-
# Allows running the container with LAMMPS args directly, e.g., `docker run <image> -in in.lj`
119-
ENTRYPOINT ["lmp"]
120-
121-
# Provide a default command to display help if no other args are provided.
122-
CMD ["-h"]
123-
# Generated by fractale build agent
1+
# Base image: Ubuntu 22.04 LTS for a stable and recent environment
2+
FROM ubuntu:22.04
3+
4+
# Set non-interactive frontend for package managers to avoid prompts
5+
ENV DEBIAN_FRONTEND=noninteractive
6+
7+
# Configure OpenMPI for containerized environments
8+
# Allow running MPI as root, a requirement for this specific Dockerfile
9+
ENV OMPI_ALLOW_RUN_AS_ROOT=1
10+
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
11+
# Force components to work over TCP, common in container orchestrators
12+
ENV OMPI_MCA_btl=self,tcp
13+
ENV OMPI_MCA_pml=ob1
14+
ENV OMPI_MCA_btl_tcp_if_include=eth0
15+
ENV OMPI_MCA_oob_tcp_if_include=eth0
16+
17+
# Install build dependencies, git, cmake, and MPI libraries
18+
# Added python3 to satisfy the LAMMPS cmake build system dependency
19+
RUN apt-get update && \
20+
apt-get install -y --no-install-recommends \
21+
build-essential \
22+
cmake \
23+
git \
24+
wget \
25+
ca-certificates \
26+
g++ \
27+
openmpi-bin \
28+
libopenmpi-dev \
29+
libfftw3-dev \
30+
python3 && \
31+
rm -rf /var/lib/apt/lists/*
32+
33+
# Clone, build, and install LAMMPS
34+
# Using 'develop' branch as 'master' is no longer a valid branch in the LAMMPS repository
35+
# A selection of common CPU packages are enabled, including REAXFF as requested
36+
RUN git clone --depth 1 -b develop https://github.com/lammps/lammps.git /lammps && \
37+
cd /lammps && \
38+
mkdir build && \
39+
cd build && \
40+
cmake ../cmake \
41+
-D CMAKE_INSTALL_PREFIX=/usr/local \
42+
-D BUILD_MPI=yes \
43+
-D BUILD_OMP=yes \
44+
-D PKG_KSPACE=yes \
45+
-D PKG_MOLECULE=yes \
46+
-D PKG_RIGID=yes \
47+
-D PKG_MANYBODY=yes \
48+
-D PKG_REAXFF=yes \
49+
-D PKG_MISC=yes \
50+
-D PKG_EXTRA-COMPUTE=yes \
51+
-D PKG_EXTRA-DUMP=yes \
52+
-D PKG_EXTRA-FIX=yes \
53+
-D PKG_EXTRA-MOLECULE=yes && \
54+
make -j$(nproc) && \
55+
make install
56+
57+
# Set the working directory for the container
58+
WORKDIR /data
59+
60+
# Copy the requested example files into the working directory
61+
# These files can be used for initial testing or as templates
62+
RUN cp /lammps/examples/reaxff/HNS/* /data/ && \
63+
# Clean up the source code to reduce final image size
64+
rm -rf /lammps
65+
66+
# Set the default entrypoint to the LAMMPS executable
67+
# The executable is on the PATH due to the CMAKE_INSTALL_PREFIX
68+
ENTRYPOINT ["lmp"]
69+
70+
# Default command can be overridden, e.g., docker run <image> -in in.script
71+
CMD ["-h"]

examples/agent/README.md

Lines changed: 65 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
11
# Agents
22

3-
Let's use fractale to run build, execute, and deploy agents. Right now we will run these a-la-carte, and eventually we will group them together to loop back and repeat steps if needed.
3+
Let's use fractale to run build, execute, and deploy agents. First now we will run these a-la-carte, and then we will group them together to be run by an agent to request steps when needed.
44

5-
## Build
5+
## A-la-carte
6+
7+
### Build
68

79
The build agent will use the Gemini API to generate a Dockerfile and then build until it succeeds. We would need subsequent agents to test it.
810
Here is how to first ask the build agent to generate a lammps container for Google cloud.
911

1012
```bash
11-
fractale agent build lammps --environment "google cloud" --outfile dockerfile
13+
fractale agent build lammps --environment "google cloud CPU" --outfile Dockerfile --details "Ensure all globbed files from examples/reaxff/HNS from the root of the lammps codebase are in the WORKDIR. Clone the latest branch of LAMMPS."
1214
```
1315

16+
Note that we are specific about the data and using CPU, which is something the builder agent would have to guess.
1417
That might generate the [Dockerfile](Dockerfile) here, and a container that defaults to the application name "lammps"
1518

16-
## Kubernetes Job
19+
### Kubernetes Job
1720

1821
The kubernetes job agent agent will be asked to run a command, and will be provided the Dockerfile and name of the container. We assume that another agent (or you) have built and either pushed the image to a registry, or loaded it. Let's create our cluster and load the image:
1922

@@ -25,6 +28,63 @@ kind load docker-image lammps
2528
To start, we will assume a kind cluster running and tell the agent the image is loaded into it (and so the pull policy will be never).
2629

2730
```bash
28-
fractale agent kubernetes-job lammps --environment "google cloud CPU" --context-file ./Dockerfile --no-pull
31+
fractale agent kubernetes-job lammps --environment "google cloud CPU" --context-file ./Dockerfile --no-pull --details "Run in.reaxff.hns in the pwd with lmp" --outfile ./job.yaml
32+
```
33+
34+
## With Cache
35+
36+
The same steps can be run using a cache. This will save to a deterministic path in the present working directory, and means that you can run steps a la carte, and run a workflow later to re-use the context (and not wait again).
37+
Note that when you save a cache, you often don't need to save the output file, because it will be the result in the context.
38+
39+
```bash
40+
fractale agent build lammps --environment "google cloud CPU" --details "Ensure all globbed files from examples/reaxff/HNS from the root of the lammps codebase are in the WORKDIR. Clone the latest branch of LAMMPS." --use-cache
2941
```
3042

43+
And then try running with the manager (below) with the cache to see it being used.
44+
45+
## Manager
46+
47+
Let's run with a manager. Using a manager means we provide a plan along with a goal. The manager itself takes on a similar structure to a step agent, but it has a high level goal. The manager will follow the high level structure of the plan, and step
48+
managers can often run independently for some number of attempts. If a step manager
49+
returns after these attempts still with a failure, or if the last step is a failure,
50+
the manager can decide how to act. For example, if a Kubernetes job deploys but fails,
51+
it could be due to the Dockerfile build (the build manager) or the manifest for the Job.
52+
The Job manager needs to prepare the updated context to return to this step, and then
53+
try again.
54+
55+
```bash
56+
fractale agent --plan ./plans/run-lammps.yaml
57+
58+
# or try using with the cache
59+
fractale agent --plan ./plans/run-lammps.yaml --use-cache
60+
```
61+
62+
We haven't hit the case yet where the manager needs to take over - that needs further development, along with being goal oriented (e.g., parsing a log and getting an output).
63+
64+
## Notes
65+
66+
#### To do items
67+
68+
- Figure out optimization agent (with some goal)
69+
70+
#### Research Questions
71+
72+
**And experiment ideas**
73+
74+
- How do we define stability?
75+
- What are the increments of change (e.g., "adding a library")? We should be able to keep track of times for each stage and what changed, and an analyzer LLM can look at result and understand (categorize) most salient contributions to change.
76+
- We also can time the time it takes to do subsequent changes, when relevant. For example, if we are building, we should be able to use cached layers (and the build times speed up) if the LLM is changing content later in the Dockerfile.
77+
- We can also save the successful results (Dockerfile builds, for example) and compare for similarity. How consistent is the LLM?
78+
- How does specificity of the prompt influence the result?
79+
- For an experiment, we would want to do a build -> deploy and successful run for a series of apps and get distributions of attempts, reasons for failure, and a general sense of similarity / differences.
80+
- For the optimization experiment, we'd want to do the same, but understand gradients of change that led to improvement.
81+
82+
#### Observations
83+
84+
- Specifying cpu seems important - if you don't it wants to do GPU
85+
- If you ask for a specific example, it sometimes tries to download data (tell it where data is)
86+
- There are issues that result from not enough information. E.g., if you don't tell it what to run / data, it can only guess. It will loop forever.
87+
- As an example, we know where in a git clone is the data of interest. The LLM can only guess. It's easier to tell it exactly.
88+
- An LLM has no sense of time with respect to versions. For example, the reax data changed from reaxc to reaxff in the same path, and which you get depends on the clone. Depending on when the LLM was trained with how to build lammps, it might select an older (or latest) branch. Instead of a juggling or guessing game (that again) would result in an infinite loop, we need to tell it the branch and data file explicitly.
89+
- Always include common issues in the initial prompt
90+
- If you are too specific about instance types, it adds node selectors/affinity, and that often doesn't work.

examples/agent/plans/run-lammps.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: Build and Deploy LAMMPS
2+
description: Build a Docker container and deploy it as a Kubernetes Job.
3+
plan:
4+
5+
# Important: everything you want to provide to the manager agent should be defined.
6+
# Agents can pass steps in between, but the manager is always given stateless context.
7+
- agent: build
8+
context:
9+
environment: "google cloud CPU instance in Kubernetes"
10+
application: lammps
11+
max_attempts: 10
12+
details: |
13+
Ensure all globbed files from examples/reaxff/HNS from the root of the lammps codebase are in the WORKDIR. Clone the latest branch of LAMMPS.
14+
15+
- agent: kubernetes-job
16+
context:
17+
no_pull: true
18+
environment: "google cloud CPU instance in Kubernetes"
19+
max_attempts: 10
20+
details: |
21+
Run in.reaxff.hns in the pwd with lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxff.hns -nocite

0 commit comments

Comments
 (0)