Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from d0fc897 to 3f0e751 Compare January 9, 2026 00:00
Copy link
Contributor Author

MasterPtato commented Jan 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@MasterPtato MasterPtato force-pushed the 01-08-fix_gas_dont_include_error_iters_in_metric branch from ce9397d to a6a1a86 Compare January 9, 2026 00:00
@claude
Copy link

claude bot commented Jan 9, 2026

Code Review

This PR makes several small but important fixes. Overall the changes look good, but I have a few observations and suggestions:

✅ Positive Changes

  1. Actor Error Tracking Fix (actor/mod.rs:1061-1065) - The change to only set crash failure reason for StopCode::Error is correct. Previously, the code checked code != StopCode::Ok, which was redundant since StopCode only has two variants (Ok and Error). The new pattern match is clearer and more explicit about the intent.

  2. Removed Pending Actors Metric - Removing the ACTOR_PENDING_ALLOCATION metric tracking makes sense if it wasn't being used. The metric isn't referenced in any Grafana dashboards, so this cleanup is safe.

  3. TypeScript Error Logging (runner/src/mod.ts:876) - Using stringifyError() instead of direct string interpolation is good practice for consistent error formatting across the codebase.

  4. Added Instrumentation (pegboard-runner/src/conn.rs:257) - Adding #[tracing::instrument(skip_all)] to handle_init improves observability.

  5. Removed Noisy Log (runner2.rs:848-855) - Removing the drain_older_versions activity log makes sense if it was too verbose and not providing value.

🔍 Observations

  1. Metric Definition Still Exists - While the metric is no longer being set, the ACTOR_PENDING_ALLOCATION definition remains in engine/packages/pegboard/src/metrics.rs:4-9. Consider removing the metric definition entirely if it's no longer used anywhere.

  2. Test May Need Review - There's a test named actor_pending_allocation_no_runners in engine/packages/engine/tests/actors_lifecycle.rs:786. While the test itself doesn't appear to check the metric value, the test name suggests it was related to this metric functionality. Verify that the test still provides value without the metric.

  3. Duplicate Changes - The Grafana dashboard title change ("Rivet Guard" → "Guard") appears in 6 files. Per CLAUDE.md, these should be regenerated from the template. Consider verifying that the template at docker/template/ was updated and the configs were regenerated properly rather than manually editing each file.

🐛 Potential Issue

Error in runner2.rs Metric Name (metrics.rs:26-31) - This appears to be a pre-existing bug, but worth noting:

pub static ref SERVERLESS_OUTBOUND_REQ_TOTAL: IntCounterVec = register_int_counter_vec_with_registry!(
    "pegboard_runner_version_upgrade_drain_total",  // ❌ Wrong metric name
    "Count of serverless outbound requests made.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

The metric name is duplicated from RUNNER_VERSION_UPGRADE_DRAIN_TOTAL. Should be pegboard_serverless_outbound_req_total.

📝 Suggestions

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Consider renaming or updating the test actor_pending_allocation_no_runners to better reflect what it's testing (pending actor allocation behavior, not the metric)
  3. Fix the SERVERLESS_OUTBOUND_REQ_TOTAL metric name bug (separate PR if needed)

Summary

The changes are well-targeted and improve code clarity. The error tracking fix is particularly good as it makes the crash handling logic more explicit. Just consider the cleanup suggestions above to fully remove traces of the unused metric.

@claude
Copy link

claude bot commented Jan 9, 2026

Code Review - PR #3797

Summary

This PR makes three focused improvements:

  1. Removes the ACTOR_PENDING_ALLOCATION metric and its tracking logic
  2. Fixes the actor error tracking to only set crash failure reasons for actual errors
  3. Improves error message formatting in the TypeScript runner

Detailed Feedback

✅ Strengths

1. Metric Removal (runner.rs & runner2.rs)

  • Clean removal of the pending_actor_count tracking logic
  • Simplifies the allocation transaction by removing unnecessary state tracking
  • The metric is properly removed from all usage sites

2. Actor Error Tracking Fix (actor/mod.rs:1061-1065)

  • Good fix! The previous logic was checking for non-Ok codes but then failing with ensure!, which would cause crashes
  • The new pattern match specifically targets StopCode::Error, which is the correct condition for setting crash failure reasons
  • This aligns with the comment "Runner failure reasons are already set at the start of handle_stopped"

3. TypeScript Error Formatting (runner/src/mod.ts:876)

  • Using stringifyError ensures proper error serialization instead of potentially getting [object Object]
  • Good consistency with other error logging in the file

4. Instrumentation Addition (pegboard-runner/src/conn.rs:257)

  • Adding #[tracing::instrument(skip_all)] to handle_init improves observability

⚠️ Concerns & Suggestions

1. Orphaned Metric Definition

The ACTOR_PENDING_ALLOCATION metric is still defined in engine/packages/pegboard/src/metrics.rs:4-9 but is no longer used anywhere. This should be removed entirely.

// Remove this from metrics.rs
pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    "pegboard_actor_pending_allocation",
    "Total actors waiting for availability.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

2. Removed Log Statement Context

In runner2.rs:848-855, the drain_older_versions activity had its tracing info statement removed. While removing unnecessary logging can be good, consider whether this was providing useful debugging information. If this was just noisy logging, the removal is fine. If it was helpful for debugging draining operations, consider keeping it or documenting why it was removed.

3. Grafana Dashboard Changes

The title change from "Rivet Guard" to "Guard" is applied across multiple environments. This appears to be an unrelated change to the main PR purpose. Consider:

  • Is this intentional or accidental?
  • Should this be in a separate PR for clarity?
  • Based on CLAUDE.md, these files should be regenerated from docker/template/. Did you regenerate them properly?

4. Missing Test Coverage

The PR doesn't include tests for:

  • The fixed actor error tracking logic (verifying that only StopCode::Error triggers crash failure reasons)
  • The removal of the pending actor count metric (ensuring metrics are still correctly tracked)

Security & Performance

No security concerns identified.

Performance: The removal of pending actor count tracking should provide a minor performance improvement by reducing state management in the allocation loop.

Code Style & Best Practices

Minor issues:

  • The orphaned metric definition should be cleaned up
  • Consider whether the Grafana changes belong in this PR

Verdict

Approve with minor changes recommended:

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Verify the Grafana dashboard changes are intentional and properly regenerated
  3. Consider adding tests for the actor error tracking fix

The core fixes are solid and improve the codebase. The actor error tracking fix in particular addresses what appears to be a real bug.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 9, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3797

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3797

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3797

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3797

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3797

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3797

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3797

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3797

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3797

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3797

commit: 22e1cee

@claude
Copy link

claude bot commented Jan 9, 2026

PR Review: fix: remove pending actors metric, fix actor error tracker, engine runner error print

Summary

This PR makes three improvements to the Pegboard actor system:

  1. Removes the ACTOR_PENDING_ALLOCATION metric tracking
  2. Fixes the actor crash failure reason tracking
  3. Improves error logging in the TypeScript runner SDK

Code Quality: ✅ Good

Strengths:

  • Clean removal of unnecessary metric tracking code
  • Consistent changes across both runner implementations (runner.rs and runner2.rs)
  • Proper use of pattern matching for the crash handling fix
  • Follows existing error handling patterns in the TypeScript SDK

Detailed Review

1. Metric Removal (runner.rs, runner2.rs)

Location: engine/packages/pegboard/src/workflows/{runner.rs,runner2.rs}:allocate_pending_actors

Changes:

  • Removed pending_actor_count tracking variable
  • Removed ACTOR_PENDING_ALLOCATION metric updates

Analysis:

  • The metric is still defined in metrics.rs but no longer used
  • Recommendation: Consider removing the metric definition from engine/packages/pegboard/src/metrics.rs:4-9 to clean up unused code

2. Actor Crash Handling Fix (actor/mod.rs:1058-1073)

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:1061-1065

Before:

if let StoppedVariant::Normal { code, message } = &variant {
    ensure!(
        *code != protocol::mk2::StopCode::Ok,
        "expected non-Ok stop code in crash handler, got Ok"
    );
    // Set failure reason...
}

After:

if let StoppedVariant::Normal {
    code: protocol::mk2::StopCode::Error,
    message,
} = &variant {
    // Set failure reason...
}

Analysis: ✅ Excellent improvement

  • More precise: Now only sets the Crashed failure reason for actual errors (StopCode::Error)
  • Better semantics: The old code used a negative check (!= Ok) with an ensure! that would panic. The new code uses positive pattern matching
  • Safer: Removes the panic path - if the code is not Error, it simply doesn't set the failure reason
  • Correct behavior: According to the comment, "Runner failure reasons are already set at the start of handle_stopped", so this only needs to handle actor crashes (Error code), not runner failures

This is particularly important in the (CrashPolicy::Sleep, false) branch where we're handling crash scenarios.

3. Runner Error Logging (mod.ts:876)

Location: engine/sdks/typescript/runner/src/mod.ts:876

Change:

// Before
msg: `WebSocket error: ${ev.error}`
// After
msg: `WebSocket error: ${stringifyError(ev.error)}`

Analysis: ✅ Good consistency

  • Aligns with error logging patterns used throughout the file (lines 265, 345, 796, 859, 896, 959, 984, 1002, 1793)
  • The stringifyError helper properly formats Error objects with stack traces
  • Prevents [object Object] in logs when errors are complex objects

4. Additional Changes

Tracing Instrumentation (conn.rs:257):

#[tracing::instrument(skip_all)]
pub async fn handle_init(...)

✅ Good addition for observability

Removed Debug Log (runner2.rs:850-856):

  • Removed verbose drain_older_versions activity log
    ✅ Appropriate - the activity name already appears in traces

Grafana Dashboard Titles:

  • Changed "Rivet Guard" → "Guard" across all environments
    ✅ Minor cleanup, consistent naming

Potential Issues

⚠️ Incomplete Cleanup

Severity: Low

The ACTOR_PENDING_ALLOCATION metric definition remains in metrics.rs but is no longer used anywhere:

// engine/packages/pegboard/src/metrics.rs:4-9
pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    "pegboard_actor_pending_allocation",
    "Total actors waiting for availability.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

Recommendation: Remove this metric definition to avoid confusion and reduce registry overhead.

💡 Unrelated Metric Bug

Severity: Low

While reviewing, I noticed an existing bug in metrics.rs:26-31:

pub static ref SERVERLESS_OUTBOUND_REQ_TOTAL: IntCounterVec = register_int_counter_vec_with_registry!(
    "pegboard_runner_version_upgrade_drain_total",  // ← Wrong name!
    "Count of serverless outbound requests made.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

The metric name duplicates RUNNER_VERSION_UPGRADE_DRAIN_TOTAL. This is a pre-existing issue, not introduced by this PR.

Performance Considerations

✅ No concerns

  • Removing the metric tracking reduces overhead in the hot allocation path
  • The crash handling change is equally performant (pattern matching vs ensure!)

Security Concerns

✅ No issues identified

Test Coverage

⚠️ Recommendation: Add test coverage for:

  1. Verify that StopCode::Ok does NOT set a Crashed failure reason
  2. Verify that StopCode::Error DOES set a Crashed failure reason with the correct message
  3. Ensure runner failures still get their failure reasons set correctly at the start of handle_stopped

Architecture Notes

The crash handling fix aligns well with the separation of concerns:

  • Runner failures: Set at the start of handle_stopped
  • Actor crashes: Set in the CrashPolicy::Sleep branch only for Error stop codes

This is a cleaner design than the previous approach.

Summary Recommendations

Required:

  • None - PR is ready to merge as-is

Nice to have:

  1. Remove unused ACTOR_PENDING_ALLOCATION metric from metrics.rs
  2. Add test coverage for the crash handling logic
  3. File separate issue for the SERVERLESS_OUTBOUND_REQ_TOTAL metric name bug

Verdict: ✅ APPROVE

This PR makes solid improvements to the actor crash handling logic and removes unnecessary metric tracking. The changes are well-structured and follow the codebase patterns. The actor error tracking fix is particularly good - it replaces a negative assertion that could panic with positive pattern matching.

@MasterPtato MasterPtato changed the base branch from 01-08-fix_gas_dont_include_error_iters_in_metric to graphite-base/3797 January 9, 2026 00:38
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 3f0e751 to 1ba3107 Compare January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from graphite-base/3797 to main January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from main to graphite-base/3797 January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 1ba3107 to 22e1cee Compare January 10, 2026 02:00
@vercel
Copy link

vercel bot commented Jan 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivet-cloud Ready Ready Preview, Comment Jan 10, 2026 2:02am
rivetkit-serverless Error Error Jan 10, 2026 2:02am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-inspector Ignored Ignored Preview Jan 10, 2026 2:02am
rivet-site Ignored Ignored Preview Jan 10, 2026 2:02am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants