fix(daemon): refresh staged binaries when the bundle's copy changes#306
fix(daemon): refresh staged binaries when the bundle's copy changes#306AprilNEA wants to merge 1 commit into
Conversation
seed_from_bundle and ensure_agent_binary copied the agent, runtime
binaries, and boot assets only when the destination was *absent*, so after
an app update the daemon kept running stale staged copies under
~/.arcbox/bin and ~/.arcbox/runtime — most visibly a stale guest agent that
fails to come up against a newer kernel/rootfs ("timeout waiting for agent").
Add copy_if_changed(), which refreshes a file when it is missing or differs
from the source by size or mtime, mirroring the source mtime so later
startups detect an up-to-date copy with a cheap stat (no content read — the
runtime tree is hundreds of MB). Route agent, runtime, and boot seeding
through it. std::fs::copy carries over the source permission bits.
There was a problem hiding this comment.
✅ No new issues found.
Reviewed changes — a fix that makes the daemon's app-bundle/boot-cache seeding refresh staged binaries when the bundle's copy changes, instead of only copying when the destination is absent.
- Add
copy_if_changed— copiessrc→dstwhendstis missing or differs by size or mtime, mirrors the source mtime onto the destination, and returns whether a copy occurred. Relies onstd::fs::copycarrying over source permission bits, which lets the old manual exec-bit handling be removed. - Route all seed paths through it —
seed_from_bundle(agent),seed_dir_files(boot assets),seed_dir_recursive(runtime tree), andensure_agent_binary(boot-cache fallback) now refresh changed files rather than skipping any present destination. - Dependencies + tests — adds
filetimeand atempfiledev-dep; three unit tests cover copy-then-skip-when-identical, recopy-on-source-change, and exec-bit preservation.
ℹ️ Re-copy avoidance depends on mtime round-tripping exactly
The cheap-stat optimization assumes the value written by set_file_mtime reads back exactly equal on the next startup (assets.rs:99-100). This holds on the project's target filesystems — APFS, ext4, btrfs, xfs all store nanosecond mtimes, and source and destination both live on the same volume.
It is worth being aware of the failure mode on coarse-granularity filesystems (FAT, some network mounts): there set_file_mtime's value would be truncated, the equality check would fail every run, and the daemon would re-copy the entire hundreds-of-MB runtime tree on every start. Those aren't supported platforms for ~/.arcbox, so this is awareness only, not a requested change.
Technical details
# Re-copy avoidance depends on mtime round-tripping exactly
## Affected sites
- `app/arcbox-daemon/src/startup/assets.rs:96-110` — `copy_if_changed`: after `std::fs::copy`, `filetime::set_file_mtime(dst, src_mtime)` is the only signal that lets the next run's `dst_meta.mtime == src_mtime` comparison short-circuit. Strict equality means any precision loss on the destination FS defeats it.
## Required outcome
- None required for the supported platforms (APFS/ext4/btrfs/xfs round-trip nanosecond mtimes). Documented here so a future port to a coarse-mtime store doesn't silently regress warm-boot time by re-copying the full runtime tree every start.
## Suggested approach (optional)
- If broader-FS robustness is ever needed, compare with a tolerance (e.g. floor both mtimes to whole seconds) or treat `dst_mtime >= src_mtime && size matches` as up-to-date, rather than strict `==`.Claude Opus | 𝕏
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 242e83d5a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| std::fs::create_dir_all(agent_dest.parent().unwrap()) | ||
| .context("Failed to create agent bin directory")?; | ||
| std::fs::copy(&agent_src, &agent_dest).context("Failed to copy agent binary")?; | ||
| if copy_if_changed(&agent_src, &agent_dest).context("Failed to install agent binary")? { |
There was a problem hiding this comment.
Prefer bundle agent over boot-cache fallback
When launching from the app bundle, startup/mod.rs calls seed_from_bundle(&data_dir) and then ensure_agent_binary(&data_dir) in the same blocking task. If ~/.arcbox/boot/{version}/arcbox-agent already exists and differs from the bundle agent, this unconditional fallback copy immediately overwrites the just-refreshed Contents/Resources/bin/arcbox-agent, so an app update can still run an older cached agent and create daemon/guest protocol skew. The fallback should skip when the bundle supplied the agent, or otherwise prefer the bundle copy.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR fixes the daemon's stale-binary problem by replacing the "copy only when destination is absent" logic with a new
Confidence Score: 3/5Safe to merge only if the boot-asset version key is always bumped whenever the agent binary changes; if agent-only releases are possible with the same version key, merging as-is can reproduce the stale-agent failure the PR set out to cure. The
Important Files Changed
Sequence DiagramsequenceDiagram
participant M as mod.rs / init_runtime
participant SFB as seed_from_bundle
participant EAB as ensure_agent_binary
participant EBA as ensure_boot_assets
participant CIC as copy_if_changed
M->>SFB: "seed_from_bundle(&data_dir)"
SFB->>CIC: copy_if_changed(bundle/bin/arcbox-agent, bin/arcbox-agent)
CIC-->>SFB: copied (new bundle agent staged)
SFB->>CIC: "copy_if_changed(bundle/runtime/**, runtime/**)"
CIC-->>SFB: copied N files
M->>EAB: "ensure_agent_binary(&data_dir)"
Note over EAB: boot/{version}/arcbox-agent may be OLDER than just-staged bundle agent
EAB->>CIC: "copy_if_changed(boot/{ver}/arcbox-agent, bin/arcbox-agent)"
alt Boot cache agent differs from bundle agent
CIC-->>EAB: copied — overwrites new bundle agent with old boot-cache agent
else Boot cache agent matches
CIC-->>EAB: skipped
end
M->>EBA: ensure_boot_assets (async)
EBA-->>M: download new assets if not cached
Reviews (1): Last reviewed commit: "fix(daemon): refresh staged binaries whe..." | Re-trigger Greptile |
| @@ -231,17 +232,66 @@ pub(super) fn ensure_agent_binary(data_dir: &Path) -> Result<()> { | |||
| return Ok(()); | |||
| } | |||
|
|
|||
| std::fs::create_dir_all(agent_dest.parent().unwrap()) | |||
| .context("Failed to create agent bin directory")?; | |||
| std::fs::copy(&agent_src, &agent_dest).context("Failed to copy agent binary")?; | |||
| if copy_if_changed(&agent_src, &agent_dest).context("Failed to install agent binary")? { | |||
| info!("Agent binary installed from boot cache"); | |||
| } | |||
| Ok(()) | |||
There was a problem hiding this comment.
ensure_agent_binary can silently downgrade the bundle-staged agent
The old if agent_dest.exists() { return Ok(()); } guard made ensure_agent_binary a no-op after seed_from_bundle had already staged the agent. Removing it means ensure_agent_binary now compares boot/{version}/arcbox-agent (the old boot cache) against bin/arcbox-agent (just updated from the bundle) and, if they differ in size or mtime, overwrites the freshly-staged bundle agent with the older boot cache copy.
Concrete failure path: user installs an app update that ships a new arcbox-agent in the bundle but keeps the same boot-asset version key (kernel/rootfs unchanged). seed_from_bundle correctly refreshes bin/arcbox-agent to the new version. ensure_agent_binary then sees a mtime mismatch between the old boot/{version}/arcbox-agent and the newly written bin/arcbox-agent, and copies the old one back — producing the same stale-agent symptom the PR set out to fix.
The docstring already acknowledges the design intent ("this is the fallback path for non-bundle installs"), but the call site in mod.rs invokes both unconditionally: seed_from_bundle(&data_dir) then ensure_agent_binary(&data_dir). Adding an early-return when seed_from_bundle succeeded (e.g. by checking whether the bundle path is present) or only calling ensure_agent_binary when find_bundle_contents() returns None would restore the correct priority.
| fn copy_if_changed(src: &Path, dst: &Path) -> std::io::Result<bool> { | ||
| let src_meta = std::fs::metadata(src)?; | ||
| let src_mtime = filetime::FileTime::from_last_modification_time(&src_meta); | ||
| if let Ok(dst_meta) = std::fs::metadata(dst) { | ||
| if dst_meta.len() == src_meta.len() | ||
| && filetime::FileTime::from_last_modification_time(&dst_meta) == src_mtime | ||
| { | ||
| return Ok(false); | ||
| } | ||
| } | ||
| if let Some(parent) = dst.parent() { | ||
| std::fs::create_dir_all(parent)?; | ||
| } | ||
| std::fs::copy(src, dst)?; | ||
| filetime::set_file_mtime(dst, src_mtime)?; | ||
| Ok(true) | ||
| } |
There was a problem hiding this comment.
Non-atomic overwrite leaves a window of partially-written binaries
std::fs::copy writes directly to dst. If the daemon process is killed (OOM, SIGKILL, power loss) after the copy begins but before it completes, dst is left truncated. The old code only ever wrote to non-existent files, so the risk was limited to first-install; now every app update triggers an overwrite of live staged binaries.
The self-healing property (copy_if_changed detects the size mismatch on the next start and retries) limits the exposure window to one failed startup, but during that window the daemon will fail to launch any binary that was partially overwritten. A write-to-temp-then-rename approach would make each individual file replacement atomic and eliminate this window entirely. Given that these are executable binaries the daemon depends on at boot, the tradeoff is worth noting even if the expected failure rate is low.

Problem
seed_from_bundleandensure_agent_binarycopied the agent, runtime binaries, and boot assets into~/.arcbox/only when the destination was absent (!dst.exists()). Because the agent (bin/arcbox-agent) and runtime tree (runtime/) are not versioned by path, once staged they were never refreshed — so after an app update the daemon kept running stale staged binaries.This bit us in the field: a stale guest agent (from a previous version) persisted across an update and failed to come up against the newer kernel/rootfs, leaving the daemon crash-looping on
timeout waiting for agent. The bundled runtime binaries (dockerd/containerd/runc/FEX/…) had the same latent staleness.(Boot assets dodged it only because they live under a version-keyed path
boot/{version}/.)Fix
Add
copy_if_changed(src, dst): copies whendstis missing or differs fromsrcby size or mtime, and mirrors the source mtime onto the destination so later startups detect an up-to-date copy with a cheapstat— no content read, which matters because seeding runs on every daemon start and the runtime tree is hundreds of MB.std::fs::copycarries over the source permission bits (incl. the exec bit). Route agent, runtime, and boot seeding through it; this also removes the now-redundant manual exec-bit handling.Verification
copy_if_changed): copies-when-missing then skips-when-identical, recopies-on-source-change, preserves-exec-bit — all green.cargo clippy -p arcbox-daemon --all-targetsclean;cargo fmtclean;cargo test -p arcbox-daemon9 passed.Context: this is the latent bug behind the agent incident also addressed operationally alongside arcbox-desktop#234 / arcbox#304.