Skip to content

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Sep 26, 2025

This fixes batch resumes after stalling. Existing runs that are stuck here and are already EXECUTING again will have to be replayed.

Copy link

changeset-bot bot commented Sep 26, 2025

⚠️ No Changeset found

Latest commit: 6ca4c3c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Walkthrough

  • Introduces batchId propagation in run execution flows.
  • index.ts: Passes batchId to tryNackAndRequeue through completion input in stalled-snapshot handling for PENDING_EXECUTING.
  • runAttemptSystem.ts: Adds optional batchId to startRunAttempt and tryNackAndRequeue method signatures; forwards batchId when creating execution snapshots and during requeue operations.
  • No other logic, error handling, or exported/public API changes beyond the two method signatures.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not follow the repository’s required template and omits key sections such as the issue reference, checklist, testing steps, changelog, and screenshots, making it incomplete and noncompliant. Please update the description to use the provided template by adding “Closes #”, completing the checklist items, describing the testing steps you performed, summarizing the changelog entry, and including any relevant screenshots.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly and concisely summarizes the primary change by indicating that the fix carries over the batchId after a PENDING_EXECUTING stall, which directly reflects the main purpose of the pull request.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/batch-resume-stall

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@matt-aitken matt-aitken merged commit 9aedda2 into main Sep 26, 2025
28 of 29 checks passed
@matt-aitken matt-aitken deleted the fix/batch-resume-stall branch September 26, 2025 13:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (2)

1007-1027: Missing batchId when requeuing after retry

In the retry path that nacks and requeues, batchId isn’t forwarded, so the new snapshot can lose batch context. Pass latestSnapshot.batchId here too.

Apply:

-                const nackResult = await this.tryNackAndRequeue({
+                const nackResult = await this.tryNackAndRequeue({
                   run,
                   environment: run.runtimeEnvironment,
                   orgId: run.runtimeEnvironment.organizationId,
                   projectId: run.runtimeEnvironment.project.id,
                   timestamp: retryAt.getTime(),
                   error: {
                     type: "INTERNAL_ERROR",
                     code: "TASK_RUN_DEQUEUED_MAX_RETRIES",
                     message: `We tried to dequeue the run the maximum number of times but it wouldn't start executing`,
                   },
+                  batchId: latestSnapshot.batchId ?? undefined,
                   tx: prisma,
                 });

1133-1164: Thread batchId through all tryNackAndRequeue call sites

  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts:1015 – add batchId: run.batchId
  • internal-packages/run-engine/src/engine/systems/dequeueSystem.ts:634 – add batchId: run.batchId
  • internal-packages/run-engine/src/engine/index.ts:1447 – add batchId: run.batchId
🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (1)

1039-1056: Consider propagating batchId on immediate (short‑delay) retry snapshots

For consistency, include batchId when creating a new EXECUTING snapshot without requeue, so downstream logic relying on batch context remains intact.

Apply:

               const newSnapshot = await this.executionSnapshotSystem.createExecutionSnapshot(
                 prisma,
                 {
                   run,
                   snapshot: {
                     executionStatus: "EXECUTING",
                     description: "Attempt failed with a short delay, starting a new attempt",
                   },
                   previousSnapshotId: latestSnapshot.id,
                   environmentId: latestSnapshot.environmentId,
                   environmentType: latestSnapshot.environmentType,
                   projectId: latestSnapshot.projectId,
                   organizationId: latestSnapshot.organizationId,
+                  batchId: latestSnapshot.batchId ?? undefined,
                   workerId,
                   runnerId,
                 }
               );
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 743b8db and 6ca4c3c.

📒 Files selected for processing (2)
  • internal-packages/run-engine/src/engine/index.ts (1 hunks)
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • internal-packages/run-engine/src/engine/index.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (2)
internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts (1)

1198-1213: Good: snapshot includes batchId on requeue

Passing batchId into createExecutionSnapshot for the QUEUED snapshot preserves batch context after nack/requeue.

internal-packages/run-engine/src/engine/index.ts (1)

1453-1464: Good: pass batchId when requeuing stalled PENDING_EXECUTING

This ensures the requeued QUEUED snapshot retains the batch association.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants