Skip to content

fix(workflow): skip un-resumed items during nested batch/loop interrupt-resume#2644

Merged
shentongmartin merged 4 commits intomainfrom
fix/batch_sub_exe_id
Mar 25, 2026
Merged

fix(workflow): skip un-resumed items during nested batch/loop interrupt-resume#2644
shentongmartin merged 4 commits intomainfrom
fix/batch_sub_exe_id

Conversation

@shentongmartin
Copy link
Copy Markdown
Collaborator

@shentongmartin shentongmartin commented Mar 25, 2026

Problem

When a batch/loop node contains a sub-workflow with an interruptible node, resuming one interrupted item causes the other (un-resumed) items to re-execute unnecessarily. The resume itself succeeds — the functional result is correct — but the re-execution of un-resumed items regenerates their sub-execute-IDs, breaking execution history.

Example: batch[items: a, b] → sub_workflow → QA_node — both items interrupt. User resumes item a. Item b should be skipped (it hasn't been resumed yet), but instead it re-runs and re-interrupts, generating a new sub-execute-ID. The execution history for item b's original run is now orphaned.

Root Cause

There are two option-passing paths for resume:

  1. Direct path (toResumeIndexes): Used when the interrupting node is a direct child of the batch/loop
  2. Nested path (optsForIndexed): Used when the interrupting node is inside a sub-workflow within the batch/loop

The batch/loop skip logic only checked path (1). When resume came through path (2), un-resumed items weren't recognized as "should skip" — they re-executed, re-interrupted, and got new sub-execute-IDs.

Solution

Add HasIndexedOpts() and HasOptsForIndex(i) methods to NodeOptions, and wire them into the batch/loop skip logic as a second check alongside the existing GetResumeIndexes() check.

// batch.go / loop.go — inside the per-item loop
if existingCState.Index2InterruptInfo[i] != nil {
    if len(options.GetResumeIndexes()) > 0 {
        if _, ok := options.GetResumeIndexes()[i]; !ok {
            return // skip: direct path, not this item
        }
    } else if options.HasIndexedOpts() {
        if !options.HasOptsForIndex(i) {
            return // skip: nested path, not this item
        }
    }
}

Key Insight

The resume option chain is built inside-out by walking the interrupt event's NodePath. Each nesting layer peels off one wrapping at runtime:

Layer type Wrapping Unwrapping
Sub-workflow WithOptsForNested(inner) GetOptsForNested() → pass to inner Runner
Composite (innermost) WithResumeIndex(i, modifier) GetResumeIndexes()[i] → apply modifier
Composite (intermediate) WithOptsForIndexed(i, inner) GetOptsForIndexed(i) → pass to inner invoke

The fix specifically addresses the "intermediate composite" case — the layer that passes resume options through to a deeper sub-workflow. This works at arbitrary nesting depth.

Summary

Problem Solution
Un-resumed batch/loop items re-execute on nested resume, replacing sub-execute-IDs Add HasIndexedOpts/HasOptsForIndex checks to skip un-resumed items in the nested path

问题

当 batch/loop 节点包含带有可中断节点的子工作流时,恢复某个被中断的项目会导致其他(未恢复的)项目被不必要地重新执行。恢复本身是成功的——功能结果正确——但未恢复项目的重新执行会重新生成其 sub-execute-ID,破坏执行历史。

示例batch[items: a, b] → sub_workflow → QA_node — 两个项目都中断。用户恢复项目 a。项目 b 应该被跳过(还没有被恢复),但它却重新执行并重新中断,生成了新的 sub-execute-ID。项目 b 原始执行的历史记录由此成为孤儿记录。

根因

恢复有两条选项传递路径:

  1. 直接路径 (toResumeIndexes):中断节点是 batch/loop 的直接子节点时使用
  2. 嵌套路径 (optsForIndexed):中断节点在 batch/loop 内的子工作流中时使用

batch/loop 的跳过逻辑只检查了路径 (1)。当恢复通过路径 (2) 传递时,未恢复的项目没有被识别为"应跳过"——它们重新执行、重新中断,并获得了新的 sub-execute-ID。

解决方案

NodeOptions 上新增 HasIndexedOpts()HasOptsForIndex(i) 方法,并在 batch/loop 的跳过逻辑中作为第二重检查。

关键洞察

恢复选项链由内向外构建,遍历中断事件的 NodePath。运行时每层拆开一层包装:

层类型 包装方式 拆包方式
子工作流 WithOptsForNested(inner) GetOptsForNested() → 传入内部 Runner
复合节点(最内层) WithResumeIndex(i, modifier) GetResumeIndexes()[i] → 应用 modifier
复合节点(中间层) WithOptsForIndexed(i, inner) GetOptsForIndexed(i) → 传入内部 invoke

本修复专门处理"中间层复合节点"的情况——该层将恢复选项透传到更深的子工作流。此机制支持任意嵌套深度。

问题 解决方案
嵌套恢复时未恢复的 batch/loop 项目被重新执行,替换了 sub-execute-ID 在 batch/loop 跳过逻辑中增加 HasIndexedOpts/HasOptsForIndex 检查以跳过未恢复项目

…e nodes

When a sub-workflow inside a batch/loop node interrupts, the resume
mechanism uses optsForIndexed (not toResumeIndexes). Add HasIndexedOpts
and HasOptsForIndex checks so batch/loop skip un-resumed items correctly.

Also fix tests to use realistic node configurations:
- Lambda nodes now pass WithLambdaType for proper NodeType propagation
- Test configs implement RequireCheckpoint for proper checkpoint enablement
- Loop test mock uses single-entry interrupt state (matching real behavior)
@shentongmartin shentongmartin changed the title fix(workflow): handle nested interrupt-resume for batch/loop composite nodes fix(workflow): skip un-resumed items during nested batch/loop interrupt-resume Mar 25, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ackend/domain/workflow/internal/execute/context.go 0.00% 3 Missing ⚠️
...ackend/domain/workflow/internal/nodes/loop/loop.go 25.00% 1 Missing and 2 partials ⚠️
backend/domain/workflow/internal/nodes/option.go 71.42% 1 Missing and 1 partial ⚠️
...kend/domain/workflow/internal/nodes/batch/batch.go 75.00% 0 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
...kend/domain/workflow/internal/nodes/batch/batch.go 65.55% <75.00%> (+13.12%) ⬆️
backend/domain/workflow/internal/nodes/option.go 70.76% <71.42%> (+24.21%) ⬆️
...ackend/domain/workflow/internal/execute/context.go 66.51% <0.00%> (+15.81%) ⬆️
...ackend/domain/workflow/internal/nodes/loop/loop.go 70.71% <25.00%> (+9.70%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shentongmartin shentongmartin added this pull request to the merge queue Mar 25, 2026
Merged via the queue into main with commit 4f786d6 Mar 25, 2026
9 checks passed
@shentongmartin shentongmartin deleted the fix/batch_sub_exe_id branch March 25, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants