Skip to content

fix(workflow-engine): replay completed void steps on restart#4186

Draft
NathanFlurry wants to merge 1 commit intomainfrom
02-12-fix-workflow-void-step-replay
Draft

fix(workflow-engine): replay completed void steps on restart#4186
NathanFlurry wants to merge 1 commit intomainfrom
02-12-fix-workflow-void-step-replay

Conversation

@NathanFlurry
Copy link
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link

railway-app bot commented Feb 12, 2026

🚅 Deployed to the rivet-pr-4186 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web Feb 13, 2026 at 9:49 pm
frontend-inspector ❌ Build Failed (View Logs) Web Feb 12, 2026 at 10:51 am
frontend-cloud ❌ Build Failed (View Logs) Web Feb 12, 2026 at 10:51 am
mcp-hub ✅ Success (View Logs) Web Feb 12, 2026 at 10:51 am
ladle ❌ Build Failed (View Logs) Web Feb 12, 2026 at 10:50 am

Copy link
Member Author


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Feb 12, 2026

Code Review

Summary

This PR fixes a critical bug in the workflow engine where steps that return undefined (void steps) were being incorrectly re-executed on workflow restart. The fix properly checks the step's metadata status instead of relying solely on the presence of output data.

Positive Aspects ✅

  1. Root Cause Analysis: The fix correctly identifies that JSON serialization omits undefined values, making stepData.output !== undefined an unreliable completion check for void steps.

  2. Clear Documentation: The added comment at context.ts:385-388 provides excellent context explaining why this check is necessary and the technical rationale.

  3. Proper Metadata Usage: Using metadata.status === "completed" is the correct approach, as metadata is the source of truth for step execution state.

  4. Test Coverage: The new test case thoroughly validates the fix by:

    • Creating a void step
    • Running it twice
    • Verifying the step only executes once (callCount remains 1)
    • Testing with both workflow modes
  5. Consistent Logging: The debug log at context.ts:396-400 follows the existing logging pattern and will aid debugging.

Potential Concerns 🔍

1. Performance Consideration (Minor)

Location: context.ts:389-393

The fix loads metadata for every step replay, even when stepData.output !== undefined. This means successful steps with defined outputs now make an extra metadata lookup.

Current flow:

// Line 378: Fast path for steps with defined output
if (stepData.output !== undefined) {
    return stepData.output as T;
}

// Line 389: Loads metadata for ALL remaining steps
const metadata = await loadMetadata(...);

// Line 395: Checks completed status for void steps
if (metadata.status === "completed") {
    return stepData.output as T;
}

Consideration: The metadata load happens await loadMetadata() for every step that doesn't have a defined output, which could include:

  • Failed steps (needing retry checks)
  • Void steps (the bug being fixed)

Question: Is there a concern about the performance impact of loading metadata for void steps? The loadMetadata function does cache results in storage.entryMetadata, so subsequent calls for the same step should be fast. However, on first access after restart, this adds a driver read operation for each void step.

Potential optimization (if needed):

// Fast path: output exists AND not explicitly undefined in serialization
if (stepData.output !== undefined) {
    return stepData.output as T;
}

// Load metadata once for both void step check and retry logic
const metadata = await loadMetadata(...);

// Check if completed (handles void steps)
if (metadata.status === "completed") {
    return stepData.output as T;
}

// Continue with retry logic...

This is already what the PR does, so the implementation is optimal given the constraint. ✅

2. Edge Case: What if metadata is missing?

Location: context.ts:389-402

The loadMetadata function creates new metadata if none exists (based on the partial code I saw). If metadata creation returns a non-"completed" status for a step that actually completed, the void step would re-execute.

Question: Are there any scenarios where:

  • A step completes successfully
  • Its history entry is persisted with output: undefined
  • But its metadata is missing or has status !== "completed"?

This seems unlikely in normal operation, but could occur if:

  • There's a partial write failure (history saved but metadata not)
  • Metadata is manually deleted
  • There's a bug in metadata persistence

Recommendation: Consider if this scenario needs defensive handling or is considered an invalid/corrupted state that should fail loudly.

3. Test Coverage Gap (Minor)

The new test validates restart behavior, but doesn't explicitly test:

  • A void step that fails and retries
  • A void step that exhausts retries
  • Mixed workflows with both void and non-void steps

Recommendation: Consider adding tests for:

it("should retry failed void steps", async () => {
  let attempts = 0;
  const workflow = async (ctx: WorkflowContextInterface) => {
    await ctx.step("failing-void", async () => {
      attempts++;
      if (attempts < 2) throw new Error("fail");
      // return void on success
    });
  };
  // ... test that it retries and eventually succeeds
});

Code Style ✅

  • Comment style follows CLAUDE.md guidelines (complete sentences)
  • Logging follows structured logging patterns
  • Test naming is clear and descriptive

Security & Correctness ✅

  • No security concerns identified
  • The fix is logically sound and addresses the root cause
  • No risk of breaking existing functionality (non-void steps continue to use fast path)

Recommendation

Approve with minor suggestions

The fix is correct and well-implemented. The performance impact is minimal due to metadata caching. Consider:

  1. Adding test coverage for void step retry scenarios (low priority)
  2. Documenting expected behavior when metadata is missing (low priority)

The PR is ready to merge as-is. The bug fix is important and the implementation is sound.


Review generated by Claude Code - Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant