Skip to content

real container resume redux redux#129

Merged
cwlbraa merged 10 commits intomainfrom
real-container-resume-redux-redux
Jun 26, 2025
Merged

real container resume redux redux#129
cwlbraa merged 10 commits intomainfrom
real-container-resume-redux-redux

Conversation

@cwlbraa
Copy link
Contributor

@cwlbraa cwlbraa commented Jun 24, 2025

it's working!

this PR implements "real" container resume where we store each Run WithExec in the container LLB definition and shores up the base of that container by saving off InitialSourceDir in state. InitialSourceDir is a made consistent and re-hydratable by using dag.Host().Directory(forkRepoPath).AsGit().Ref(initialWorktreeHead).Tree(). because we can rely on that SHA always pointing at the exact right base content and the forkRepo always having that SHA, we can source the base of sourcedir without keeping a copy of the first commit on disk.

6/25: this PR does have a big piece of "magic" in it. we template in the .git file to be gitdir: repos/name/worktrees/id so that, on export, we don't constantly override reset the git history.

Other stuff buried in here:

  • env.apply no longer takes arguments that give the false impression it's gonna make a commit. it just takes newState (the container)
  • env.run calls apply
    • run_background still doesn't. to make those restore we gotta do some additional thinking around how we data model service persistence.

additionally on 6/25 cc @aluzzardi:

  • better error handling in the event of failed commands. we now no longer fail the withexec for failed commands, but still propagate error information to the git notes and the tool responses. this is necessary for
  • better prompt engineering around endpoints. the agent was getting confused and using localhost to try to talk to things it had run_backgrounded.

closes #115.

cwlbraa added 4 commits June 24, 2025 14:12
Signed-off-by: Connor Braa <connor@dagger.io>
Signed-off-by: Connor Braa <connor@dagger.io>
Signed-off-by: Connor Braa <connor@dagger.io>
Signed-off-by: Connor Braa <connor@dagger.io>
@cwlbraa cwlbraa force-pushed the real-container-resume-redux-redux branch from d1a9b33 to 8baab20 Compare June 24, 2025 21:13
@cwlbraa cwlbraa requested a review from aluzzardi June 24, 2025 22:01
@cwlbraa cwlbraa marked this pull request as ready for review June 24, 2025 22:01
Copy link
Contributor

@aluzzardi aluzzardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Just to make sure I have the right mental model: worktrees are now only used as a staging area to dump data to be committed. Reading data is performed directly on the forked repo.

Slight but important change.

nit without a better suggestion: initialSourceDirID took me a minute to understand. Now it makes sense but I don't have a better name to offer.


ID string
ID string
//TODO(braa): I think we only need this for Export, now. remove and pass explicitly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, AFAIK only Export() and Config.Locked() needs this.

We could remove Config.Locked (undocumented, unused feature) and make this explicit on Export().

This would be much cleaner since now repository is the only thing dealing with worktrees

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i intend to try this in a follow-up.

@cwlbraa
Copy link
Contributor Author

cwlbraa commented Jun 25, 2025

nit without a better suggestion: initialSourceDirID took me a minute to understand. Now it makes sense but I don't have a better name to offer.

Strongly agree with this, I couldn't decide on a name so I went verbose. Other options include 'BaseSourceDirID', 'InitialCommitDirID' and variations... having typed it maybe now I'm partial to BaseSourceDirID?

@aluzzardi
Copy link
Contributor

BREAKING CHANGE- currently this breaks existing environments, if they don't have an InitialCommitDir, they can no longer be opened.

Random collection of thoughts:

  • State.InitialSourceDiris actually only ever used by EnvironmentUpdate -- the "main" state resumer just re-hydrates Container (which happens to have the source dir as the first operation, but should be entirely backward compatible)
  • At the very least, with a bit of error handling, we should be able to resume old environments (with environment update explicitly erroring out because the lack of initialSourceDIr)
  • The whole EnvironmentUpdate is a bit hacky and we're not overly attached to it, wondering if there's an alternative where we actually don't even need to store the source dir in state?
  • Actually, this makes me question the whole EnvironmentUpdate ... right now with this change it means that if the user updates the environment, they go back in time to the initial source directory, right? e.g. from ubuntu ; touch foo ; from alpine --> foo is gone from the workdir.
  • So the only reason we're keeping initialSourceDir around is to support a questionable behavior.
  • Now I'm wondering if we shouldn't just take the base image at environment_create time and stop supporting it as environment_update, which is very weird in any case (it "forgets" anything the agent might have done up to that point)

@aluzzardi
Copy link
Contributor

Unfinished chain of thoughts (and this can be done in a different work stream, we don't want to tie everything to this PR, this is more of a UX/architecture brainstorm).

  • we get rid of .container-use/environment.json -- or at least, the agent maintained one. There is an optional one, it's user maintained to set up secrets, force a base image, etc.
  • environment_create takes base_image and setup_commands. The agent passes those along whenever creating an environment. They're optional, we default to the user config in environment.json (and if not provided we error out to the agent saying we need them)
  • the only reason environment.json exists in the first place is so when it's merged into main, the next agent will pick up the same environment
  • whereas in this scenario, each agent will have to independently guess ... unless enforced by the user
  • random thought: with environment resuming working well now, I'm wondering if the next agent can somehow pick up the whole environment rather than creating a new one from the template?
  • random random thought: what if we flip the scenario around and environments were created and managed by the user (e.g. cu create), agents are explicitly told where to work, and can re-use environments from other agents?
  • That would change quite a lot the UX, but not in a bad way? It would remove the ambiguity about resuming work etc? maybe

@cwlbraa
Copy link
Contributor Author

cwlbraa commented Jun 25, 2025

@aluzzardi

State.InitialSourceDir is actually only ever used by EnvironmentUpdate -- the "main" state resumer just re-hydrates Container (which happens to have the source dir as the first operation, but should be entirely backward compatible)

ah you're right! that's a good call, means the hackery i was gonna do is both unnecessary and excessive.

Actually, this makes me question the whole EnvironmentUpdate ... right now with this change it means that if the user updates the environment, they go back in time to the initial source directory, right? e.g. from ubuntu ; touch foo ; from alpine --> foo is gone from the workdir.

true, and that's worrisome.

Now I'm wondering if we shouldn't just take the base image at environment_create time and stop supporting it as environment_update, which is very weird in any case (it "forgets" anything the agent might have done up to that point)

more incremental UX brainstorming, i wonder if there's some happy medium where environment update:

  1. verifies that the update actually works for new from scratch envs, but throws away the resulting container
  2. applies "additional" setup commands and services to the tip of the current container

like conceptually, it'd be more like environment_edit?

random thought: with environment resuming working well now, I'm wondering if the next agent can somehow pick up the whole environment rather than creating a new one from the template?

why would i want to only or by default do it this way when I could just continue my current agent session to get this behavior?

random random thought: what if we flip the scenario around and environments were created and managed by the user (e.g. cu create), agents are explicitly told where to work, and can re-use environments from other agents?

as a user, this sounds like a big faff to me, like if every time i've gotta prompt the agent on which environment to use, I'm not gonna use this ever for my "baseline" case, only for parallel agents (which is not something I'm doing today) and even then i've gotta figure out how to script the prompting to mux my agents across pre-created envs.

@cwlbraa
Copy link
Contributor Author

cwlbraa commented Jun 25, 2025

@aluzzardi also the dumbest possible change to fix environment_update is still an option: base from the up-to-date source dir instead of the original one.

we can even prompt engineer addService, run_command, etc to be like "to persist this change for future environment_create calls, use environment_update"

@cwlbraa
Copy link
Contributor Author

cwlbraa commented Jun 25, 2025

soooo, possible TODOs before merging, partially because i wanna give @grouville a second to get the tests in:

A:

  • fix environment_update behavior by having it "squash" the container history onto current-worktree-as-base-sourcedir
  • use that to remove the extra state field

B:

  • remove knowledge of Worktree from environment.
  • remove environment update, add baseImage and setupCommands as optional arguments to environment_create

with B, i think there are problems around what's in the initial context windows. like we can have reasonable defaults (llm-provided->environment.json->ubuntu:latest) BUT the agent's not gonna know that the default will work, so i suspect it'll usually try to override?

A & B are in conflict with each other, unfortunately, unless we add another method to Repository to support the update case...

edit: took a bike ride, i think i can break the tradeoff here.

@aluzzardi
Copy link
Contributor

C: Merge as is, live another day to figure out what to do? :)

@aluzzardi
Copy link
Contributor

with B, i think there are problems around what's in the initial context windows. like we can have reasonable defaults (llm-provided->environment.json->ubuntu:latest) BUT the agent's not gonna know that the default will work, so i suspect it'll usually try to override?

From what I saw, LLMs are lazy and will do their best to never provide optional arguments :)

- stop saving initialSourceDir - its saved in the llb.
- have environment_update rebuild the base from workdir
- improve prompt engineering around internal/external ports
- improve run_command error handling so that failed commands can be saved into the container state

Signed-off-by: Connor Braa <connor@dagger.io>
@cwlbraa
Copy link
Contributor Author

cwlbraa commented Jun 25, 2025

@aluzzardi C wasn't a real option because the run_command error handling was way too broken.

but i fixed it 😈 b1e042c

cwlbraa added 5 commits June 25, 2025 16:22
Signed-off-by: Connor Braa <connor@dagger.io>
Signed-off-by: Connor Braa <connor@dagger.io>
…agic

Signed-off-by: Connor Braa <connor@dagger.io>
Signed-off-by: Connor Braa <connor@dagger.io>
@cwlbraa cwlbraa requested a review from aluzzardi June 26, 2025 16:01
}
newState := env.container().WithExec(args, dagger.ContainerWithExecOpts{
UseEntrypoint: useEntrypoint,
Expect: dagger.ReturnTypeAny, // Don't treat non-zero exit as error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I think this can have weird caching side effects by caching failed operations.

Particularly annoying for transient failures that won't get busted by content (e.g. pip install foo --> connection timeout --> will always return connection timeout from cache without trying).

Although not 100% sure the engine doesn't have workarounds already in place for that (/cc @sipsma ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container.WithExec("pip install foo").WithExec("pip install foo")

because we're now saving the container on failures, too, doesn't it re-run because the second withexec is against a different container?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! ...Maybe? Continue-on-failure is uncharted territory to me ... For instance, I'm not sure what would happen on cache bust, if some operation on the chain fails it would just push through anyway?

e.g.

container.WithExec("pip install foo").WithExec("pip install foo")

on cache bust, this will actually run install foo twice, right?

container.WithExec("pip install foo").WithExec("touch foo")

on cache bust, if pip install foo fails, this will actually return success and not bubble up any error to the LLM, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in -- on an operation-by-operation level it's all simple -- we manually check ExitCode and return accordingly.

However, when resuming from the Container state blob, we can't manually check ExitCodes for each exec in the chain and are relying on 1) the cache to not re-run execs OR 2) if it's not cached, the engine will re-exec and bubble up exec errors along the way, if any

I think dagger.ReturnTypeAny will prevent that, though. Emphasis on think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my mental model, yes, this gets kinda broken in the case of new failures on cache bust.

// Re-build the base image from the worktree
container, err := env.buildBase(ctx)
// Get current working directory from container to preserve changes
currentWorkdir := env.container().Directory(env.Config.Workdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOVE this. So simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb is sometimes best :)

this approach does mean that update no longer flattens the persisted container history, which was an accidental feature before - now it can get big and there's no way to collapse it. in the future we can maybe use the same AsGit().Ref().Tree() approach for update that we use for create.

@cwlbraa cwlbraa merged commit ac76bf2 into main Jun 26, 2025
2 checks passed
@aluzzardi aluzzardi deleted the real-container-resume-redux-redux branch June 26, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants