real container resume redux redux by cwlbraa · Pull Request #129 · dagger/container-use

cwlbraa · 2025-06-24T20:09:34Z

it's working!

this PR implements "real" container resume where we store each Run WithExec in the container LLB definition and shores up the base of that container by saving off InitialSourceDir in state. InitialSourceDir is a made consistent and re-hydratable by using dag.Host().Directory(forkRepoPath).AsGit().Ref(initialWorktreeHead).Tree(). because we can rely on that SHA always pointing at the exact right base content and the forkRepo always having that SHA, we can source the base of sourcedir without keeping a copy of the first commit on disk.

6/25: this PR does have a big piece of "magic" in it. we template in the .git file to be gitdir: repos/name/worktrees/id so that, on export, we don't constantly override reset the git history.

Other stuff buried in here:

env.apply no longer takes arguments that give the false impression it's gonna make a commit. it just takes newState (the container)
env.run calls apply
- run_background still doesn't. to make those restore we gotta do some additional thinking around how we data model service persistence.

additionally on 6/25 cc @aluzzardi:

better error handling in the event of failed commands. we now no longer fail the withexec for failed commands, but still propagate error information to the git notes and the tool responses. this is necessary for
better prompt engineering around endpoints. the agent was getting confused and using localhost to try to talk to things it had run_backgrounded.

closes #115.

Signed-off-by: Connor Braa <connor@dagger.io>

aluzzardi

Looks good!

Just to make sure I have the right mental model: worktrees are now only used as a staging area to dump data to be committed. Reading data is performed directly on the forked repo.

Slight but important change.

nit without a better suggestion: initialSourceDirID took me a minute to understand. Now it makes sense but I don't have a better name to offer.

aluzzardi · 2025-06-25T17:48:49Z

environment/environment.go


-	ID       string
+	ID string
+	//TODO(braa): I think we only need this for Export, now. remove and pass explicitly.


Yep, AFAIK only Export() and Config.Locked() needs this.

We could remove Config.Locked (undocumented, unused feature) and make this explicit on Export().

This would be much cleaner since now repository is the only thing dealing with worktrees

yeah, i intend to try this in a follow-up.

cwlbraa · 2025-06-25T18:09:46Z

nit without a better suggestion: initialSourceDirID took me a minute to understand. Now it makes sense but I don't have a better name to offer.

Strongly agree with this, I couldn't decide on a name so I went verbose. Other options include 'BaseSourceDirID', 'InitialCommitDirID' and variations... having typed it maybe now I'm partial to BaseSourceDirID?

aluzzardi · 2025-06-25T18:16:26Z

BREAKING CHANGE- currently this breaks existing environments, if they don't have an InitialCommitDir, they can no longer be opened.

Random collection of thoughts:

State.InitialSourceDiris actually only ever used by EnvironmentUpdate -- the "main" state resumer just re-hydrates Container (which happens to have the source dir as the first operation, but should be entirely backward compatible)
At the very least, with a bit of error handling, we should be able to resume old environments (with environment update explicitly erroring out because the lack of initialSourceDIr)
The whole EnvironmentUpdate is a bit hacky and we're not overly attached to it, wondering if there's an alternative where we actually don't even need to store the source dir in state?
Actually, this makes me question the whole EnvironmentUpdate ... right now with this change it means that if the user updates the environment, they go back in time to the initial source directory, right? e.g. from ubuntu ; touch foo ; from alpine --> foo is gone from the workdir.
So the only reason we're keeping initialSourceDir around is to support a questionable behavior.
Now I'm wondering if we shouldn't just take the base image at environment_create time and stop supporting it as environment_update, which is very weird in any case (it "forgets" anything the agent might have done up to that point)

aluzzardi · 2025-06-25T18:30:32Z

Unfinished chain of thoughts (and this can be done in a different work stream, we don't want to tie everything to this PR, this is more of a UX/architecture brainstorm).

we get rid of .container-use/environment.json -- or at least, the agent maintained one. There is an optional one, it's user maintained to set up secrets, force a base image, etc.
environment_create takes base_image and setup_commands. The agent passes those along whenever creating an environment. They're optional, we default to the user config in environment.json (and if not provided we error out to the agent saying we need them)
the only reason environment.json exists in the first place is so when it's merged into main, the next agent will pick up the same environment
whereas in this scenario, each agent will have to independently guess ... unless enforced by the user
random thought: with environment resuming working well now, I'm wondering if the next agent can somehow pick up the whole environment rather than creating a new one from the template?
random random thought: what if we flip the scenario around and environments were created and managed by the user (e.g. cu create), agents are explicitly told where to work, and can re-use environments from other agents?
That would change quite a lot the UX, but not in a bad way? It would remove the ambiguity about resuming work etc? maybe

cwlbraa · 2025-06-25T18:36:49Z

@aluzzardi

State.InitialSourceDir is actually only ever used by EnvironmentUpdate -- the "main" state resumer just re-hydrates Container (which happens to have the source dir as the first operation, but should be entirely backward compatible)

ah you're right! that's a good call, means the hackery i was gonna do is both unnecessary and excessive.

Actually, this makes me question the whole EnvironmentUpdate ... right now with this change it means that if the user updates the environment, they go back in time to the initial source directory, right? e.g. from ubuntu ; touch foo ; from alpine --> foo is gone from the workdir.

true, and that's worrisome.

Now I'm wondering if we shouldn't just take the base image at environment_create time and stop supporting it as environment_update, which is very weird in any case (it "forgets" anything the agent might have done up to that point)

more incremental UX brainstorming, i wonder if there's some happy medium where environment update:

verifies that the update actually works for new from scratch envs, but throws away the resulting container
applies "additional" setup commands and services to the tip of the current container

like conceptually, it'd be more like environment_edit?

random thought: with environment resuming working well now, I'm wondering if the next agent can somehow pick up the whole environment rather than creating a new one from the template?

why would i want to only or by default do it this way when I could just continue my current agent session to get this behavior?

random random thought: what if we flip the scenario around and environments were created and managed by the user (e.g. cu create), agents are explicitly told where to work, and can re-use environments from other agents?

as a user, this sounds like a big faff to me, like if every time i've gotta prompt the agent on which environment to use, I'm not gonna use this ever for my "baseline" case, only for parallel agents (which is not something I'm doing today) and even then i've gotta figure out how to script the prompting to mux my agents across pre-created envs.

cwlbraa · 2025-06-25T18:39:42Z

@aluzzardi also the dumbest possible change to fix environment_update is still an option: base from the up-to-date source dir instead of the original one.

we can even prompt engineer addService, run_command, etc to be like "to persist this change for future environment_create calls, use environment_update"

cwlbraa · 2025-06-25T20:45:40Z

soooo, possible TODOs before merging, partially because i wanna give @grouville a second to get the tests in:

A:

fix environment_update behavior by having it "squash" the container history onto current-worktree-as-base-sourcedir
use that to remove the extra state field

B:

remove knowledge of Worktree from environment.
remove environment update, add baseImage and setupCommands as optional arguments to environment_create

with B, i think there are problems around what's in the initial context windows. like we can have reasonable defaults (llm-provided->environment.json->ubuntu:latest) BUT the agent's not gonna know that the default will work, so i suspect it'll usually try to override?

A & B are in conflict with each other, unfortunately, unless we add another method to Repository to support the update case...

edit: took a bike ride, i think i can break the tradeoff here.

aluzzardi · 2025-06-25T22:18:47Z

C: Merge as is, live another day to figure out what to do? :)

aluzzardi · 2025-06-25T22:20:35Z

with B, i think there are problems around what's in the initial context windows. like we can have reasonable defaults (llm-provided->environment.json->ubuntu:latest) BUT the agent's not gonna know that the default will work, so i suspect it'll usually try to override?

From what I saw, LLMs are lazy and will do their best to never provide optional arguments :)

- stop saving initialSourceDir - its saved in the llb. - have environment_update rebuild the base from workdir - improve prompt engineering around internal/external ports - improve run_command error handling so that failed commands can be saved into the container state Signed-off-by: Connor Braa <connor@dagger.io>

cwlbraa · 2025-06-25T23:18:02Z

@aluzzardi C wasn't a real option because the run_command error handling was way too broken.

but i fixed it 😈 b1e042c

…-redux-redux

Signed-off-by: Connor Braa <connor@dagger.io>

…agic Signed-off-by: Connor Braa <connor@dagger.io>

Signed-off-by: Connor Braa <connor@dagger.io>

aluzzardi · 2025-06-26T16:36:39Z

environment/environment.go

 	}
 	newState := env.container().WithExec(args, dagger.ContainerWithExecOpts{
 		UseEntrypoint: useEntrypoint,
+		Expect:        dagger.ReturnTypeAny, // Don't treat non-zero exit as error


note: I think this can have weird caching side effects by caching failed operations.

Particularly annoying for transient failures that won't get busted by content (e.g. pip install foo --> connection timeout --> will always return connection timeout from cache without trying).

Although not 100% sure the engine doesn't have workarounds already in place for that (/cc @sipsma ?)

container.WithExec("pip install foo").WithExec("pip install foo")

because we're now saving the container on failures, too, doesn't it re-run because the second withexec is against a different container?

Right! ...Maybe? Continue-on-failure is uncharted territory to me ... For instance, I'm not sure what would happen on cache bust, if some operation on the chain fails it would just push through anyway?

e.g.

container.WithExec("pip install foo").WithExec("pip install foo")

on cache bust, this will actually run install foo twice, right?

container.WithExec("pip install foo").WithExec("touch foo")

on cache bust, if pip install foo fails, this will actually return success and not bubble up any error to the LLM, right?

As in -- on an operation-by-operation level it's all simple -- we manually check ExitCode and return accordingly.

However, when resuming from the Container state blob, we can't manually check ExitCodes for each exec in the chain and are relying on 1) the cache to not re-run execs OR 2) if it's not cached, the engine will re-exec and bubble up exec errors along the way, if any

I think dagger.ReturnTypeAny will prevent that, though. Emphasis on think.

in my mental model, yes, this gets kinda broken in the case of new failures on cache bust.

aluzzardi · 2025-06-26T16:40:03Z

environment/environment.go

-	// Re-build the base image from the worktree
-	container, err := env.buildBase(ctx)
+	// Get current working directory from container to preserve changes
+	currentWorkdir := env.container().Directory(env.Config.Workdir)


LOVE this. So simple.

dumb is sometimes best :)

this approach does mean that update no longer flattens the persisted container history, which was an accidental feature before - now it can get big and there's no way to collapse it. in the future we can maybe use the same AsGit().Ref().Tree() approach for update that we use for create.

cwlbraa mentioned this pull request Jun 24, 2025

environment: persistent run command changes #128

Closed

cwlbraa added 4 commits June 24, 2025 14:12

working

9342873

Signed-off-by: Connor Braa <connor@dagger.io>

cleanup

db1cd3a

Signed-off-by: Connor Braa <connor@dagger.io>

clean up apply

a835674

Signed-off-by: Connor Braa <connor@dagger.io>

shore up error handling in run command

8baab20

Signed-off-by: Connor Braa <connor@dagger.io>

cwlbraa force-pushed the real-container-resume-redux-redux branch from d1a9b33 to 8baab20 Compare June 24, 2025 21:13

cwlbraa requested a review from aluzzardi June 24, 2025 22:01

cwlbraa marked this pull request as ready for review June 24, 2025 22:01

aluzzardi approved these changes Jun 25, 2025

View reviewed changes

cwlbraa added 5 commits June 25, 2025 16:22

Merge remote-tracking branch 'origin/main' into real-container-resume…

df4d38a

…-redux-redux

resolve build failure post-merge

99b9964

Signed-off-by: Connor Braa <connor@dagger.io>

make tests pass

5a677a5

Signed-off-by: Connor Braa <connor@dagger.io>

shoutout guillaume's tests, i totally forgot about this part of the m…

bd4e65b

…agic Signed-off-by: Connor Braa <connor@dagger.io>

remove unnecessary git notes initialization

fbe53c0

Signed-off-by: Connor Braa <connor@dagger.io>

cwlbraa requested a review from aluzzardi June 26, 2025 16:01

aluzzardi approved these changes Jun 26, 2025

View reviewed changes

cwlbraa merged commit ac76bf2 into main Jun 26, 2025
2 checks passed

aluzzardi deleted the real-container-resume-redux-redux branch June 26, 2025 20:26

Conversation

cwlbraa commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aluzzardi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwlbraa commented Jun 25, 2025

Uh oh!

aluzzardi commented Jun 25, 2025

Uh oh!

aluzzardi commented Jun 25, 2025

Uh oh!

cwlbraa commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwlbraa commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwlbraa commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aluzzardi commented Jun 25, 2025

Uh oh!

aluzzardi commented Jun 25, 2025

Uh oh!

cwlbraa commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cwlbraa commented Jun 24, 2025 •

edited

Loading

cwlbraa commented Jun 25, 2025 •

edited

Loading

cwlbraa commented Jun 25, 2025 •

edited

Loading

cwlbraa commented Jun 25, 2025 •

edited

Loading