Skip to content

Conversation

@MasterPtato
Copy link
Contributor

@MasterPtato MasterPtato commented Jan 12, 2026

Fixes RVT-5330

Copy link
Contributor Author

MasterPtato commented Jan 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@linear
Copy link

linear bot commented Jan 12, 2026

@claude
Copy link

claude bot commented Jan 12, 2026

PR Review: fix(pb): have restart policy actors sleep after pending for too long

Summary

This PR addresses RVT-5330 by adding a timeout mechanism for Rivet Actors with CrashPolicy::Restart that are stuck in pending state. After 60 seconds of pending, these actors will transition to sleep state instead of waiting indefinitely for allocation.

Changes

  1. Added a new AllocationOverride::PendingTimeout variant with a required timeout parameter
  2. Introduced RESTART_PENDING_TIMEOUT_MS constant (60 seconds)
  3. Updated handle_stopped to use PendingTimeout override when rescheduling actors with CrashPolicy::Restart
  4. Fixed the match statement in spawn_actor to handle the new variant

✅ Positive Observations

  1. Good architecture: The new PendingTimeout variant is semantically clearer than overloading DontSleep for this use case.

  2. Correct timeout handling: The match expression in runtime.rs:766-777 properly handles the new variant by applying the timeout to the listen_with_timeout call.

  3. Proper state handling: The change correctly handles the SpawnActorOutput::Sleep case that can now be returned (line mod.rs:1069), fixing the previously incorrect comment.

  4. Reasonable timeout: 60 seconds is a sensible default for detecting allocation failures without being too aggressive.


🔍 Issues Found

🚨 Critical: Missing force_allocate logic for new variant

Location: runtime.rs:617-620 and runtime.rs:629-632

The force_allocate parameter is only set to true for AllocationOverride::DontSleep, but not for the new AllocationOverride::PendingTimeout variant:

force_allocate: matches\!(
    &allocation_override,
    AllocationOverride::DontSleep { .. }
),

Problem: When using PendingTimeout, actors might be immediately put to sleep by the allocation logic if they have CrashPolicy::Sleep, bypassing the intended pending timeout behavior.

Fix needed: Update the match to include the new variant:

force_allocate: matches\!(
    &allocation_override,
    AllocationOverride::DontSleep { .. } | AllocationOverride::PendingTimeout { .. }
),

This should be applied in both locations (V1 and V2 allocation calls).

⚠️ Minor: Inconsistent comment style

Location: mod.rs:17-18

The comment uses "should" which is slightly imprecise:

/// How long an actor with crash_policy Restart should wait pending before setting itself to sleep.

Suggestion: Make it more definitive per CLAUDE.md guidelines:

/// Maximum time an actor with crash_policy Restart will wait in pending state before transitioning to sleep.

⚠️ Minor: Magic number could be documented

Location: mod.rs:18

While 60 seconds is reasonable, there's no explanation of why this specific value was chosen.

Suggestion: Add a brief comment explaining the rationale:

/// Maximum time an actor with crash_policy Restart will wait in pending state before transitioning to sleep.
/// 60 seconds provides sufficient time for typical allocation while preventing indefinite pending.
const RESTART_PENDING_TIMEOUT_MS: i64 = util::duration::seconds(60);

📋 Additional Considerations

  1. Test coverage: No tests were added for this new behavior. Consider adding tests that verify:

    • Actors with CrashPolicy::Restart transition to sleep after 60s of pending
    • The timeout mechanism doesn't interfere with successful allocations
    • The SpawnActorOutput::Sleep return value is handled correctly
  2. Observability: The existing log at runtime.rs:878 ("timed out before actor allocated") will help track this behavior, which is good.

  3. Performance impact: Minimal - the change only affects the code path for crashed actors with restart policy.

  4. Security: No security implications identified.


🎯 Recommendations

Must fix before merge:

  • Fix the force_allocate logic to include AllocationOverride::PendingTimeout

Nice to have:

  • Update the comment to be more definitive
  • Add test coverage for the new timeout behavior
  • Document the timeout value rationale

Verdict

The PR correctly implements the core logic for timing out pending restart-policy actors, but has a critical bug in the force_allocate logic that could cause incorrect behavior. Once the force_allocate issue is fixed, this change should work as intended.


Review generated by Claude Code

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 12, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3837

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3837

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3837

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3837

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3837

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3837

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3837

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3837

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3837

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3837

commit: e96badf

This was referenced Jan 12, 2026
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b613527 to b4daf93 Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch 2 times, most recently from 363b7a5 to 3f06e1b Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b4daf93 to 0b9ca2f Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 3f06e1b to 8cd75d7 Compare January 13, 2026 01:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 0b9ca2f to 1a08758 Compare January 13, 2026 01:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 8cd75d7 to 75ce8e5 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 1a08758 to 5b1c724 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 5b1c724 to 8152462 Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch 2 times, most recently from 50ba096 to b9a9c50 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 8152462 to 82ed36d Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 82ed36d to f5a4f68 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from b9a9c50 to 9a4ab5a Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f5a4f68 to 9e4ae6f Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 9a4ab5a to 33e8961 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 33e8961 to 633544a Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 9e4ae6f to 595d7b8 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 633544a to e073155 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 595d7b8 to 250667d Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 250667d to 980dd0a Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from e073155 to e96badf Compare January 14, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants