Skip to content

fix(daemon): refresh staged binaries when the bundle's copy changes#306

Open
AprilNEA wants to merge 1 commit into
masterfrom
fix/agent-update-mechanism
Open

fix(daemon): refresh staged binaries when the bundle's copy changes#306
AprilNEA wants to merge 1 commit into
masterfrom
fix/agent-update-mechanism

Conversation

@AprilNEA

Copy link
Copy Markdown
Member

Problem

seed_from_bundle and ensure_agent_binary copied the agent, runtime binaries, and boot assets into ~/.arcbox/ only when the destination was absent (!dst.exists()). Because the agent (bin/arcbox-agent) and runtime tree (runtime/) are not versioned by path, once staged they were never refreshed — so after an app update the daemon kept running stale staged binaries.

This bit us in the field: a stale guest agent (from a previous version) persisted across an update and failed to come up against the newer kernel/rootfs, leaving the daemon crash-looping on timeout waiting for agent. The bundled runtime binaries (dockerd/containerd/runc/FEX/…) had the same latent staleness.

(Boot assets dodged it only because they live under a version-keyed path boot/{version}/.)

Fix

Add copy_if_changed(src, dst): copies when dst is missing or differs from src by size or mtime, and mirrors the source mtime onto the destination so later startups detect an up-to-date copy with a cheap stat — no content read, which matters because seeding runs on every daemon start and the runtime tree is hundreds of MB. std::fs::copy carries over the source permission bits (incl. the exec bit). Route agent, runtime, and boot seeding through it; this also removes the now-redundant manual exec-bit handling.

Verification

  • New unit tests (copy_if_changed): copies-when-missing then skips-when-identical, recopies-on-source-change, preserves-exec-bit — all green.
  • cargo clippy -p arcbox-daemon --all-targets clean; cargo fmt clean; cargo test -p arcbox-daemon 9 passed.

Context: this is the latent bug behind the agent incident also addressed operationally alongside arcbox-desktop#234 / arcbox#304.

seed_from_bundle and ensure_agent_binary copied the agent, runtime
binaries, and boot assets only when the destination was *absent*, so after
an app update the daemon kept running stale staged copies under
~/.arcbox/bin and ~/.arcbox/runtime — most visibly a stale guest agent that
fails to come up against a newer kernel/rootfs ("timeout waiting for agent").

Add copy_if_changed(), which refreshes a file when it is missing or differs
from the source by size or mtime, mirroring the source mtime so later
startups detect an up-to-date copy with a cheap stat (no content read — the
runtime tree is hundreds of MB). Route agent, runtime, and boot seeding
through it. std::fs::copy carries over the source permission bits.

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ No new issues found.

Reviewed changes — a fix that makes the daemon's app-bundle/boot-cache seeding refresh staged binaries when the bundle's copy changes, instead of only copying when the destination is absent.

  • Add copy_if_changed — copies srcdst when dst is missing or differs by size or mtime, mirrors the source mtime onto the destination, and returns whether a copy occurred. Relies on std::fs::copy carrying over source permission bits, which lets the old manual exec-bit handling be removed.
  • Route all seed paths through itseed_from_bundle (agent), seed_dir_files (boot assets), seed_dir_recursive (runtime tree), and ensure_agent_binary (boot-cache fallback) now refresh changed files rather than skipping any present destination.
  • Dependencies + tests — adds filetime and a tempfile dev-dep; three unit tests cover copy-then-skip-when-identical, recopy-on-source-change, and exec-bit preservation.

ℹ️ Re-copy avoidance depends on mtime round-tripping exactly

The cheap-stat optimization assumes the value written by set_file_mtime reads back exactly equal on the next startup (assets.rs:99-100). This holds on the project's target filesystems — APFS, ext4, btrfs, xfs all store nanosecond mtimes, and source and destination both live on the same volume.

It is worth being aware of the failure mode on coarse-granularity filesystems (FAT, some network mounts): there set_file_mtime's value would be truncated, the equality check would fail every run, and the daemon would re-copy the entire hundreds-of-MB runtime tree on every start. Those aren't supported platforms for ~/.arcbox, so this is awareness only, not a requested change.

Technical details
# Re-copy avoidance depends on mtime round-tripping exactly

## Affected sites
- `app/arcbox-daemon/src/startup/assets.rs:96-110``copy_if_changed`: after `std::fs::copy`, `filetime::set_file_mtime(dst, src_mtime)` is the only signal that lets the next run's `dst_meta.mtime == src_mtime` comparison short-circuit. Strict equality means any precision loss on the destination FS defeats it.

## Required outcome
- None required for the supported platforms (APFS/ext4/btrfs/xfs round-trip nanosecond mtimes). Documented here so a future port to a coarse-mtime store doesn't silently regress warm-boot time by re-copying the full runtime tree every start.

## Suggested approach (optional)
- If broader-FS robustness is ever needed, compare with a tolerance (e.g. floor both mtimes to whole seconds) or treat `dst_mtime >= src_mtime && size matches` as up-to-date, rather than strict `==`.

Pullfrog  | Fix it ➔View workflow run | Using Claude Opus𝕏

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 242e83d5a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

std::fs::create_dir_all(agent_dest.parent().unwrap())
.context("Failed to create agent bin directory")?;
std::fs::copy(&agent_src, &agent_dest).context("Failed to copy agent binary")?;
if copy_if_changed(&agent_src, &agent_dest).context("Failed to install agent binary")? {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer bundle agent over boot-cache fallback

When launching from the app bundle, startup/mod.rs calls seed_from_bundle(&data_dir) and then ensure_agent_binary(&data_dir) in the same blocking task. If ~/.arcbox/boot/{version}/arcbox-agent already exists and differs from the bundle agent, this unconditional fallback copy immediately overwrites the just-refreshed Contents/Resources/bin/arcbox-agent, so an app update can still run an older cached agent and create daemon/guest protocol skew. The fallback should skip when the bundle supplied the agent, or otherwise prefer the bundle copy.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes the daemon's stale-binary problem by replacing the "copy only when destination is absent" logic with a new copy_if_changed helper that copies whenever the destination is missing or differs from the source by size or mtime, then mirrors the source mtime onto the destination for cheap subsequent checks. The fix is applied to the agent binary, the runtime directory tree, and the individual boot-asset files.

  • copy_if_changed(src, dst) is the core change: it uses a size+mtime heuristic (same approach as rsync --archive) to decide whether to copy, then sets the destination mtime to match the source so the check is a stat-only operation on repeat starts — important because the runtime tree is hundreds of MB.
  • seed_from_bundle, seed_dir_files, and seed_dir_recursive are all updated to route through copy_if_changed; the old manual exec-bit-setting code is removed because std::fs::copy already carries permission bits.
  • ensure_agent_binary had its if agent_dest.exists() { return Ok(()); } early-return removed so it too refreshes on change — but this guard was also what prevented ensure_agent_binary (the boot-cache path) from overwriting a bundle-staged agent after seed_from_bundle ran, which now creates a potential priority inversion between the two staging paths.

Confidence Score: 3/5

Safe to merge only if the boot-asset version key is always bumped whenever the agent binary changes; if agent-only releases are possible with the same version key, merging as-is can reproduce the stale-agent failure the PR set out to cure.

The copy_if_changed helper and its wiring into seed_from_bundle and seed_dir_recursive are correct and well-tested. The risk is in ensure_agent_binary: removing the old early-return guard means that after seed_from_bundle writes the new bundle agent, ensure_agent_binary immediately compares it against the (potentially older) boot-cache copy and may overwrite it. Whether this fires depends on whether the boot-asset version key ever stays the same across an agent-only update — if it does, the staged agent regresses to the old version on every restart, the exact symptom the PR was fixing.

app/arcbox-daemon/src/startup/assets.rs — specifically the interaction between seed_from_bundle and ensure_agent_binary in init_runtime, and whether find_bundle_contents() returning Some should suppress the boot-cache install path.

Important Files Changed

Filename Overview
app/arcbox-daemon/src/startup/assets.rs Introduces copy_if_changed (size+mtime heuristic) and routes all binary seeding through it; fixes the core staleness bug but removes an early-return guard in ensure_agent_binary that prevented the boot cache from overwriting a bundle-staged agent
app/arcbox-daemon/Cargo.toml Adds filetime (runtime) and tempfile (dev) dependencies to support mtime-mirroring in copy_if_changed and the new unit tests; straightforward dependency additions
Cargo.lock Lockfile updated for filetime 0.2.29 (from 0.2.27) and tempfile; also drops now-unused libredox, plain, redox_syscall 0.7.3 transitive dependencies — all expected churn from the new direct deps

Sequence Diagram

sequenceDiagram
    participant M as mod.rs / init_runtime
    participant SFB as seed_from_bundle
    participant EAB as ensure_agent_binary
    participant EBA as ensure_boot_assets
    participant CIC as copy_if_changed

    M->>SFB: "seed_from_bundle(&data_dir)"
    SFB->>CIC: copy_if_changed(bundle/bin/arcbox-agent, bin/arcbox-agent)
    CIC-->>SFB: copied (new bundle agent staged)
    SFB->>CIC: "copy_if_changed(bundle/runtime/**, runtime/**)"
    CIC-->>SFB: copied N files

    M->>EAB: "ensure_agent_binary(&data_dir)"
    Note over EAB: boot/{version}/arcbox-agent may be OLDER than just-staged bundle agent
    EAB->>CIC: "copy_if_changed(boot/{ver}/arcbox-agent, bin/arcbox-agent)"
    alt Boot cache agent differs from bundle agent
        CIC-->>EAB: copied — overwrites new bundle agent with old boot-cache agent
    else Boot cache agent matches
        CIC-->>EAB: skipped
    end

    M->>EBA: ensure_boot_assets (async)
    EBA-->>M: download new assets if not cached
Loading

Reviews (1): Last reviewed commit: "fix(daemon): refresh staged binaries whe..." | Re-trigger Greptile

Comment on lines 222 to +238
@@ -231,17 +232,66 @@ pub(super) fn ensure_agent_binary(data_dir: &Path) -> Result<()> {
return Ok(());
}

std::fs::create_dir_all(agent_dest.parent().unwrap())
.context("Failed to create agent bin directory")?;
std::fs::copy(&agent_src, &agent_dest).context("Failed to copy agent binary")?;
if copy_if_changed(&agent_src, &agent_dest).context("Failed to install agent binary")? {
info!("Agent binary installed from boot cache");
}
Ok(())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ensure_agent_binary can silently downgrade the bundle-staged agent

The old if agent_dest.exists() { return Ok(()); } guard made ensure_agent_binary a no-op after seed_from_bundle had already staged the agent. Removing it means ensure_agent_binary now compares boot/{version}/arcbox-agent (the old boot cache) against bin/arcbox-agent (just updated from the bundle) and, if they differ in size or mtime, overwrites the freshly-staged bundle agent with the older boot cache copy.

Concrete failure path: user installs an app update that ships a new arcbox-agent in the bundle but keeps the same boot-asset version key (kernel/rootfs unchanged). seed_from_bundle correctly refreshes bin/arcbox-agent to the new version. ensure_agent_binary then sees a mtime mismatch between the old boot/{version}/arcbox-agent and the newly written bin/arcbox-agent, and copies the old one back — producing the same stale-agent symptom the PR set out to fix.

The docstring already acknowledges the design intent ("this is the fallback path for non-bundle installs"), but the call site in mod.rs invokes both unconditionally: seed_from_bundle(&data_dir) then ensure_agent_binary(&data_dir). Adding an early-return when seed_from_bundle succeeded (e.g. by checking whether the bundle path is present) or only calling ensure_agent_binary when find_bundle_contents() returns None would restore the correct priority.

Comment on lines +95 to +111
fn copy_if_changed(src: &Path, dst: &Path) -> std::io::Result<bool> {
let src_meta = std::fs::metadata(src)?;
let src_mtime = filetime::FileTime::from_last_modification_time(&src_meta);
if let Ok(dst_meta) = std::fs::metadata(dst) {
if dst_meta.len() == src_meta.len()
&& filetime::FileTime::from_last_modification_time(&dst_meta) == src_mtime
{
return Ok(false);
}
}
if let Some(parent) = dst.parent() {
std::fs::create_dir_all(parent)?;
}
std::fs::copy(src, dst)?;
filetime::set_file_mtime(dst, src_mtime)?;
Ok(true)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-atomic overwrite leaves a window of partially-written binaries

std::fs::copy writes directly to dst. If the daemon process is killed (OOM, SIGKILL, power loss) after the copy begins but before it completes, dst is left truncated. The old code only ever wrote to non-existent files, so the risk was limited to first-install; now every app update triggers an overwrite of live staged binaries.

The self-healing property (copy_if_changed detects the size mismatch on the next start and retries) limits the exposure window to one failed startup, but during that window the daemon will fail to launch any binary that was partially overwritten. A write-to-temp-then-rename approach would make each individual file replacement atomic and eliminate this window entirely. Given that these are executable binaries the daemon depends on at boot, the tradeoff is worth noting even if the expected failure rate is low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant