Skip to content

Comments

fix(acp): auto-retry npx without --prefer-offline on cache failure#939

Merged
IceyLiu merged 2 commits intomainfrom
zynx/fix/acp-npx-cache-retry
Feb 22, 2026
Merged

fix(acp): auto-retry npx without --prefer-offline on cache failure#939
IceyLiu merged 2 commits intomainfrom
zynx/fix/acp-npx-cache-retry

Conversation

@piorpua
Copy link
Contributor

@piorpua piorpua commented Feb 22, 2026

Summary

  • 两阶段 npx 启动策略:Claude 和 CodeBuddy 的 ACP 连接现在先以 --prefer-offline 快速启动,失败时自动去掉该 flag 重试,强制刷新 registry 缓存
  • 添加 --yes flag:Claude npx 调用补齐 --yes(与 CodeBuddy 对齐),防止 stdin piped 时交互提示挂起
  • 提取辅助方法spawnAndSetupClaude / spawnAndSetupCodebuddy 封装 spawn 逻辑,支持 preferOffline 参数控制

背景

PR #911 将 Claude ACP bridge 从 @zed-industries/claude-code-acp 迁移到 @zed-industries/claude-agent-acp,commit 2d3f824 锁定版本为 @0.18.0。升级用户因 npm 缓存中残留的旧 registry 元数据导致 ETARGET/ERESOLVE 错误,此前只能手动执行 npm cache clean --force 解决。

行为变化

场景 之前 之后
正常启动 --prefer-offline 快速启动 相同,无额外开销
缓存损坏/升级后 ETARGET 错误,需手动清缓存 自动重试,首次稍慢,后续恢复快速
npx 交互提示 (Claude) 可能挂起 --yes 自动跳过

Test plan

  • 正常启动 Claude 对话,确认 --prefer-offline 模式正常工作
  • 正常启动 CodeBuddy 对话,确认行为一致
  • 模拟缓存损坏(破坏 ~/.npm/_npx/ 缓存数据),确认自动重试成功
  • 检查控制台日志,确认重试时输出 warn 级别日志

When users upgrade from pre-PR#911 versions, stale npm cache metadata
causes ETARGET/ERESOLVE errors with --prefer-offline. Instead of
requiring manual `npm cache clean --force`, implement a two-phase
startup: try --prefer-offline first for fast startup, then retry
without it to refresh the cache automatically.

Also adds --yes flag to Claude npx invocation (matching CodeBuddy)
to prevent interactive prompts when stdin is piped.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

HIGH Issues

1. 新增 console.warn 违反 debug hygiene(且可能进入生产日志)

File: src/agent/acp/AcpConnection.ts:268-276

console.warn('[ACP] --prefer-offline failed, retrying with fresh registry lookup:', firstError instanceof Error ? firstError.message : String(firstError));

Problem: 项目 ESLint 规则里 no-console 是 warn,并且 review 维度明确要求“新增 console.log/console.debug 要重点标记”。这里新增了 console.warn(同属 console 家族),会在正常用户环境中输出,且该路径在缓存异常/升级后是“常见可复现”的场景,容易造成生产日志噪音与隐私/环境信息泄露风险(错误信息可能包含路径、registry 细节等)。

Fix: 用项目统一 logger(若已有)替代,或至少 gated 到 ACP_PERF_LOG / debug flag。示例:

if (ACP_PERF_LOG) {
  console.warn(
    '[ACP] --prefer-offline failed, retrying with fresh registry lookup:',
    firstError instanceof Error ? firstError.message : String(firstError)
  );
}

或改为内部 logger:logger.warn(...)(按项目现有日志体系落地)。


2. CodeBuddy 重试路径未清理 isDetached 可能导致后续 disconnect 走错分支

File: src/agent/acp/AcpConnection.ts:321-343

// Phase 1: Try with --prefer-offline for fast startup
try {
  await this.spawnAndSetupCodebuddy(spawnCommand, cleanEnv, workingDir, isWindows, extraArgs, true);
} catch (firstError) {
  // ...
  this.child = null;
  this.isSetupComplete = false;
  this.isDetached = false;

  await this.spawnAndSetupCodebuddy(spawnCommand, cleanEnv, workingDir, isWindows, extraArgs, false);
}

Problem: 这里 catch 中把 this.child = null 直接覆盖,但没有确保“第一次 spawn 出来的进程树”被终止。更关键的是:spawnAndSetupCodebuddy 内部会设置 this.isDetached = !isWindowsunref();如果第一次 spawn 成功创建了 detached 进程但在 setup/initialize 阶段失败,catch 里把 this.child 置空后,外部再也无法通过 disconnect() 的逻辑去 kill 该进程组,可能遗留后台 CodeBuddy/npx 进程(资源泄漏、端口占用、重复实例)。

这在“缓存损坏导致 npx 失败/退出”场景下非常现实:进程可能已启动但很快退出/或卡在某个阶段,当前代码没有可靠的清理路径。

Fix: 在重试前显式清理第一次 child(如果存在且有 pid),而不是仅置空引用。建议抽一个“terminateChildIfAny()”并在两处重试共用:

private async terminateChildIfAny(): Promise<void> {
  if (!this.child) return;
  await this.disconnect(); // 复用现有跨平台 kill 逻辑
}

catch (firstError) {
  if (ACP_PERF_LOG) { /* log */ }
  await this.terminateChildIfAny();
  // disconnect 已经会 reset state;如需保留部分状态再单独设置
  await this.spawnAndSetupCodebuddy(..., false);
}

Claude 的重试同理(虽然 Claude 非 detached,但仍可能遗留子进程/占用 stdin/stdout 句柄)。


MEDIUM Issues

3. spawnAndSetup* 参数列表过长,且 isWindows 可在方法内部计算

File: src/agent/acp/AcpConnection.ts

当前签名:

private async spawnAndSetupClaude(
  spawnCommand: string,
  cleanEnv: Record<string, string | undefined>,
  workingDir: string,
  isWindows: boolean,
  preferOffline: boolean
): Promise<void>

Description: isWindows 在多个调用点重复计算并传参,增加了调用复杂度,也更容易在未来重构时传错参数顺序(尤其是多个 boolean 参数)。同样问题也存在于 spawnAndSetupCodebuddy(...)(还额外带 extraArgs)。

Recommendation: 在方法内部计算 const isWindows = process.platform === 'win32';,并把参数改为对象参数以避免 boolean 位置错误:

private async spawnAndSetupClaude(opts: {
  spawnCommand: string;
  cleanEnv: Record<string, string | undefined>;
  workingDir: string;
  preferOffline: boolean;
}): Promise<void> { /* ... */ }

Summary

Level Count
CRITICAL 0
HIGH 2
MEDIUM 1

🤖 This review was generated by AI and may contain inaccuracies. Please focus on issues you agree with and feel free to disregard any that seem incorrect. Thank you for your contribution!

The retry catch blocks previously set this.child = null without killing
the spawned process, which could leave orphaned npx/CLI processes if
the first attempt failed after spawn (e.g. initialize timeout).

Extract terminateChild() from disconnect() to deduplicate the
platform-aware kill logic (Windows taskkill tree kill, POSIX detached
process group kill, standard SIGTERM). Both disconnect() and retry
paths now share this single method.
@piorpua
Copy link
Contributor Author

piorpua commented Feb 22, 2026

Review Verification

对 Code Review 中的 3 个 issue 逐一验证。

Code Review Issues Verification

HIGH #1: 新增 console.warn 违反 debug hygiene — ❌ False Positive

文件中已有大量未被 ACP_PERF_LOG 门控的 console.warn / console.error

  • L159: console.warn('[ACP] Node.js version check skipped...')
  • L173: console.warn('[ACP] PATH corrected...')
  • L246: console.error('[ACP] Using NPX approach...')
  • L931: console.warn('[ACP] Failed to normalize cwd...')
  • L1031: console.warn('[ACP] taskkill /F failed...')

项目惯例:只有性能指标 (ACP-PERF) 使用 ACP_PERF_LOG 门控,操作性警告(异常路径、重要状态变更)直接使用 console.warn/error。新增的 console.warn 只在缓存失败触发重试时输出(升级后首次对话的罕见路径),符合现有模式。


HIGH #2: 重试路径未终止旧子进程 — ✅ Confirmed & Fixed

确认存在问题:catch 块中 this.child = null 仅清空引用而未 kill 进程。当 initialize() 超时(60s)但子进程仍存活时,会导致:

  1. 孤儿进程泄漏(资源占用)
  2. 旧子进程的 exit handler 引用 this.isSetupComplete 共享状态,可能干扰第二次连接

修复(0dedb18d):提取 terminateChild() 方法,覆盖所有平台(Windows taskkill tree kill / POSIX detached group kill / 标准 SIGTERM)。同时重构 disconnect() 复用该方法,消除重复代码。


MEDIUM #3: 参数列表过长 — ⚠️ Minor

有效观察,但属于代码风格优化,在 bug fix PR 中不宜引入额外的 API 重构。可作为后续改进。


Summary

# Severity Issue Verdict Action
1 HIGH console.warn debug hygiene ❌ False Positive 忽略
2 HIGH 重试路径未终止旧子进程 ✅ Confirmed 已修复 (0dedb18)
3 MEDIUM 参数列表过长 ⚠️ Minor 后续优化

@IceyLiu IceyLiu merged commit 4a923f3 into main Feb 22, 2026
8 checks passed
@piorpua piorpua deleted the zynx/fix/acp-npx-cache-retry branch February 22, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants