Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 22, 2025 2:45am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 22, 2025 2:45am
rivet-inspector Ignored Ignored Preview Nov 22, 2025 2:45am
rivet-site Ignored Ignored Preview Nov 22, 2025 2:45am

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 18, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3483

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3483

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3483

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3483

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3483

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3483

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3483

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3483

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3483

commit: fa56065

@claude
Copy link

claude bot commented Nov 18, 2025

Pull Request Review

Overview

This PR simplifies the runner workflow by removing the command/event handling logic from the runner workflow and moving event forwarding directly to actor workflows. This is a significant refactoring that reduces ~500 lines of code by eliminating intermediate state management.

Key Changes

  1. Event Forwarding Architecture Change

    • Before: Events were sent to the runner workflow, stored in state, then forwarded to actor workflows
    • After: Events are forwarded directly from ws_to_tunnel_task.rs to actor workflows using the actor_id tag
    • This eliminates the need for the runner workflow to track last_event_idx, last_event_ack_idx, and event storage
  2. Removed State Management (runner2.rs:29-40)

    • Removed last_event_idx, last_command_idx, and commands vector from State
    • Simplified LifecycleState by removing event tracking fields
    • This reduces memory overhead and workflow complexity
  3. Removed Activities

    • InsertDb: Database initialization logic removed (~160 lines)
    • ProcessInit: Init packet processing removed (~80 lines)
    • InsertEvents: Event storage removed (~25 lines)
    • InsertCommands: Command storage and batching removed (~35 lines)
  4. Signal Changes

    • Removed Command signal entirely
    • Kept Forward, CheckQueue, and Stop signals
    • Forward now only handles ToServerStopping and rejects other message types

Code Quality & Best Practices

Good:

  • Follows error handling conventions with anyhow and custom RivetError types
  • Proper use of structured logging (tracing::warn!(?actor_id, ...))
  • Consistent naming conventions (snake_case, past tense for timestamps)
  • Good use of workspace dependencies
  • Clear separation of concerns by moving event handling closer to where it's needed

⚠️ Concerns:

  1. Missing Actor ID in Event Forwarding (ws_to_tunnel_task.rs:341-356)

    protocol::ToServer::ToServerEvents(events) => {
        let res = ctx.signal(pegboard::workflows::runner2::Forward {
            inner: protocol::ToServer::try_from(msg)
                .context("failed to convert message for workflow forwarding")?,
        })
        .tag("actor_id", actor_id)  // ❌ ERROR: 'actor_id' not defined

    Issue: The variable actor_id is not defined in this scope. You need to extract the actor_id from the events first.

    Fix: Extract actor_id from the events before using it:

    protocol::ToServer::ToServerEvents(events) => {
        for event in &events.events {
            let actor_id = crate::utils::event_actor_id(&event.inner).to_string();
            let res = ctx.signal(pegboard::workflows::actor::Event {
                inner: event.inner.clone(),
            })
            .tag("actor_id", &actor_id)
            .graceful_not_found()
            .send()
            .await?;
            if res.is_none() {
                tracing::warn!(?actor_id, "failed to send signal to actor workflow, likely already stopped");
            }
        }
    }
  2. TODO Comment Without Implementation Plan (runner2.rs:79)

    // TODO: Ack events

    Question: How will events be acknowledged now that the runner workflow no longer tracks them? This could lead to unbounded event queues or lost events if not handled properly.

  3. Potential Lost Init Handling (runner2.rs:98-108)
    The workflow now rejects ToServerInit messages, but there's no clear indication of where init logic has moved to. The init activity only writes to RunnerByKeyKey, which is much simpler than before.

    Question: Was the metadata and prepopulated actor names handling removed intentionally, or does it need to be reimplemented elsewhere?

  4. Variable Naming Improvement (conn.rs:79,89)
    ✅ Good change renaming packet to init_packet for clarity

Potential Bugs

CRITICAL: Compilation Error

  • Line 346 in ws_to_tunnel_task.rs: actor_id is undefined

⚠️ Possible Logic Issues:

  1. Event Acknowledgment: Without event tracking in the runner workflow, how are events acknowledged back to the client? This could cause the client to resend events unnecessarily.

  2. Command Flow: The Command signal was removed entirely. Need to verify that actor start commands are still being sent through an alternative path.

  3. Race Condition Handling: The old code had logic to prevent scheduling actors to draining runners (removed code around line 110-140 of the diff). Verify this is now handled elsewhere.

Performance Considerations

Improvements:

  • Reduced workflow state size (no more storing commands/events in memory)
  • Eliminated intermediate event storage and forwarding overhead
  • Simpler signal processing loop should have better throughput

⚠️ Potential Concerns:

  • Events are now forwarded individually rather than batched - could increase signal overhead if many events arrive simultaneously
  • Consider implementing batching in the new architecture if event volume is high

Security Concerns

✅ No new security issues introduced

  • Existing validation logic for actor ownership is preserved in KV request handling
  • Error messages don't leak sensitive information

Test Coverage

⚠️ Recommendations:

  1. Add integration tests for the new event forwarding path to ensure events reach actor workflows correctly
  2. Test the timeout/expiry behavior to ensure actors are properly marked as lost when runners disconnect
  3. Verify the draining behavior works correctly without the intermediate command handling
  4. Test the init flow to ensure runners can still connect and initialize properly

Additional Questions

  1. Was the InsertDb activity's functionality moved elsewhere, or is runner initialization now handled differently?
  2. How are commands being sent to runners now that the Command signal and command batching logic are removed?
  3. What's the plan for event acknowledgment mentioned in the TODO comment?

Summary

This is a valuable simplification that removes significant complexity from the runner workflow. However, there's a critical compilation error that needs to be fixed before this can be merged. Additionally, the missing event acknowledgment logic needs to be clarified or implemented.

Recommendation: Request changes to fix the compilation error and clarify the event acknowledgment and command flow architecture.


Review generated following CLAUDE.md conventions for error handling, logging patterns, and code style.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

Overview

This PR significantly simplifies the runner workflow by removing event and command tracking logic. The changes reduce the workflow state and delegate more responsibility to actor workflows.

Key Changes Analysis

1. Removed Event/Command State Management

The PR removes:

  • Event tracking and acknowledgment logic (last_event_idx, last_event_ack_idx)
  • Command buffering and indexing (last_command_idx, commands vector)
  • Associated activities: ProcessInit, InsertEvents, InsertCommands, InsertDb

Positive: This is a good simplification that reduces state size and complexity.

2. Event Forwarding Moved to WebSocket Handler

In ws_to_tunnel_task.rs:347-362:

protocol::ToServer::ToServerEvents(events) => {
    let res = ctx.signal(pegboard::workflows::runner2::Forward {
        inner: protocol::ToServer::try_from(msg)
            .context("failed to convert message for workflow forwarding")?,
    })
    .tag("actor_id", actor_id)
    .graceful_not_found()
    // ...
}

Issue Found: This code references actor_id on lines 352 and 357, but actor_id is not defined in scope. This will cause a compilation error.

Looking at the protocol, ToServerEvents likely contains a list of events, and you need to extract the actor_id from each event. The original code had crate::utils::event_actor_id(&event.inner).

Recommendation: You need to iterate over the events and extract actor_id properly:

protocol::ToServer::ToServerEvents(events) => {
    for event in &events.events {
        let actor_id = crate::utils::event_actor_id(&event.inner);
        let res = ctx.signal(pegboard::workflows::actor::Event {
            inner: event.inner.clone(),
        })
        .to_workflow::<pegboard::workflows::actor::Workflow>()
        .tag("actor_id", &actor_id.to_string())
        .graceful_not_found()
        .send()
        .await
        .with_context(|| {
            format!("failed to forward signal to actor workflow: {}", actor_id)
        })?;
        if res.is_none() {
            tracing::warn!(?actor_id, "failed to send signal to actor workflow, likely already stopped");
        }
    }
}

3. Removed Init Message Handling 🔍

The runner workflow no longer processes ToServerInit packets, which previously:

  • Populated actor names
  • Stored runner metadata
  • Inserted runner into database indexes
  • Sent ToClientInit response with configuration

Concerns:

  1. Where does database insertion happen now? The InsertDb activity that wrote critical runner state (namespace_id, name, key, version, slots, indexes) has been removed. The Init activity only writes to RunnerByKeyKey.
  2. Runner allocation index: The removed code populated RunnerAllocIdxKey which is used by AllocatePendingActors. Without this, new runners may not be discoverable for actor allocation.
  3. Missing client response: Runners expect a ToClientInit response. Where is this sent now?

Recommendation: Verify that runner initialization is complete elsewhere, or this may break runner connectivity and actor allocation.

4. Simplified Runner Workflow State

LifecycleState now only tracks:

struct LifecycleState {
    draining: bool,
}

Positive: Much cleaner state management.

5. Variable Naming Improvement

In conn.rs:67-77: Renamed packet to init_packet for clarity. Good improvement.

6. Removed Signal Type

Removed Command signal variant since commands are no longer buffered by the runner workflow.

Potential Issues

  1. 🔴 CRITICAL: Compilation Error

    • Line 352 & 357 in ws_to_tunnel_task.rs: actor_id is undefined
    • This will fail to compile
  2. 🟡 WARNING: Missing Runner Initialization

    • Database insertion logic removed but unclear where it is replaced
    • Runner allocation index population missing
    • ToClientInit response not sent
  3. 🟡 WARNING: Event Acknowledgment

    • Line 79 has // TODO: Ack events comment
    • Without acking, runners may resend events indefinitely
  4. 🟢 INFO: Event Batch Size Constant

    • EVENT_ACK_BATCH_SIZE constant on line 16 is now unused (can be removed)

Testing Recommendations

  1. Compile test: Verify the code compiles (currently blocked by undefined actor_id)
  2. Integration test: Test full runner connection flow:
    • Runner connects and sends ToServerInit
    • Verify runner appears in allocation indexes
    • Verify actors can be allocated to the runner
  3. Event forwarding test: Verify actor events are correctly forwarded to actor workflows
  4. Graceful degradation: Test what happens when actor workflow does not exist

Security Considerations

No security issues identified. The changes maintain proper actor ownership validation in KV operations.

Performance Considerations

Positive: Removing state buffering reduces memory usage and workflow replay time. The simplified state should improve workflow performance.

Code Quality

  • Follow CLAUDE.md conventions ✅
  • Proper error handling with anyhow ✅
  • Structured logging with tracing ✅
  • Lowercase log messages ✅

Summary

This is a good architectural simplification, but has critical compilation errors that must be fixed before merging. The undefined actor_id variable needs to be resolved, and the initialization flow needs verification to ensure runners are properly registered in the database.

Recommendation: Fix the compilation error, verify the init flow is complete, and add tests before merging.

@claude
Copy link

claude bot commented Nov 19, 2025

Critical Bug Found: The variable actor_id is referenced on line 352 of ws_to_tunnel_task.rs but is never defined. This will cause a compilation error in the ToServerEvents match arm. The events need to be iterated individually and each event actor_id extracted to forward to the correct actor workflow.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review Summary - Part 1:

This PR significantly simplifies the runner workflow by removing ~500 lines of code. Major changes include:

MUST FIX:

  1. Compilation error with undefined actor_id variable in ws_to_tunnel_task.rs:352
  2. Remove unused EVENT_ACK_BATCH_SIZE constant (line 15-16 of runner2.rs)
  3. Remove unused CommandRow struct (line 43-48 of runner2.rs)

CODE QUALITY ISSUES:

  • Missing event acknowledgment implementation (TODO at line 79)
  • Events not being acknowledged to runners could cause message buildup
  • Loss of backpressure mechanisms

ARCHITECTURAL CONCERNS:

  • Removed tracking of event/command indices (last_event_idx, last_event_ack_idx, last_command_idx)
  • Removed storage of commands and events in workflow state
  • Removed InsertDb, ProcessInit, InsertEvents, InsertCommands activities (300+ lines)
  • Removed Command signal from Main join_signal

Questions to address:

  • Where is event/command state now tracked?
  • How do runners recover from disconnections?
  • How is ordering guaranteed?
  • Is this a breaking change?

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review Summary - Part 2:

PERFORMANCE CONSIDERATIONS:

Positive:

  • Significantly reduced workflow state size
  • Fewer database transactions per message
  • Better separation of concerns (events route directly to actors)

Concerns:

  • Event forwarding will need to iterate events individually
  • Missing batch acknowledgment logic

TEST COVERAGE:
No test changes included. For a refactoring this large, should add:

  1. Integration tests for new event routing path
  2. Runner reconnection scenario tests
  3. Command delivery verification
  4. Concurrent event handling tests

POSITIVE CHANGES:

  • Good naming improvement: packet renamed to init_packet (conn.rs:67)
  • Cleaner separation between runner and actor workflows
  • Simplified workflow logic

Overall: This is a valuable simplification but needs the compilation fix and testing before merge. The architectural changes need documentation to help understand the new event flow and migration path for existing runners.

@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from fefb6f0 to c088d42 Compare November 20, 2025 03:07
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 4294775 to 5458e50 Compare November 20, 2025 03:07
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 03:07 Inactive
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed this PR which simplifies the runner workflow. Here's my detailed feedback:

🎯 Overall Assessment

This is a substantial refactoring that introduces a new protocol version (v4/mk2) and simplifies the runner workflow architecture. The changes are generally well-structured, but there are several areas that need attention.


Strengths

  1. Good Protocol Versioning Strategy: The separation between mk1 and mk2 protocols using is_mk2() helper is clean and allows for graceful migration.

  2. Event Batching: The new ActorEventDemuxer properly batches events (up to 1024) which should improve performance - good optimization.

  3. Proper Shutdown Handling: The shutdown() method on ActorEventDemuxer properly awaits all spawned tasks, preventing potential data loss.


🔴 Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs

The refactored code has critical issues that will prevent compilation:

Lines ~45-65: The recv_msg() and handle_message_mk2() functions reference undefined variables:

  • tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx are not in scope
  • ctx, conn, init_tx, event_demuxer are not in scope in the refactored functions
  • Function signatures don't match their usage

The functions appear to have been extracted but weren't properly parameterized. Example:

async fn recv_msg() -> Result<std::result::Result<Option<()>, LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // ❌ tunnel_sub not in scope

Fix: Add proper parameters to these functions or inline them back into the main task loop.

2. Missing Signal Type in conn.rs:149

ctx.signal(Init);  // ❌ What is Init?

This signal type doesn't appear to be defined anywhere in the visible code. This will fail to compile.

3. Undefined Helper Function in lib.rs

Line 6 references protocol::is_new() but the protocol module only exports is_mk2(). This function doesn't exist:

let workflow_id = if protocol::is_new(protocol_version) {  // ❌ Undefined function

Should probably be protocol::is_mk2().

4. Incorrect Return Type in tunnel_to_ws_task.rs

async fn recv_msg() -> Result<std::result::Result<Option<()>, LifecycleResult>> {
    // ...
    Ok(Some(data))  // ❌ Returns data but signature says Option<()>
}

The function signature says it returns Option<()> but it's actually returning data (a Vec<u8>).


⚠️ High Priority Issues

5. Potential Memory Leak in ActorEventDemuxer

File: actor_event_demuxer.rs:36-67

The last_seen timestamp is set when channels are created but never updated when events are ingested. This means:

  • Active channels will be garbage collected after 30 seconds even if they're receiving events
  • The TODO comment on line 77 about aborting being safe suggests uncertainty

Fix:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get_mut(&actor_id) {
        channel.last_seen = Instant::now();  // ← Add this
        let _ = channel.tx.send(event);

6. Silent Error Handling

File: actor_event_demuxer.rs:38

let _ = channel.tx.send(event);  // ❌ Silently ignores send errors

If the channel is full or closed, events are silently dropped. This could lead to hard-to-debug issues.

Recommendation: At minimum, log when sends fail:

if let Err(err) = channel.tx.send(event) {
    tracing::warn!(?actor_id, ?err, "failed to send event to actor channel");
}

7. Missing Protocol Version in runner2.rs InsertDbInput

File: runner2.rs:289-299

The InsertDbInput struct includes protocol_version field (line 295), but when this struct is instantiated in the workflow (lines 74-82), the protocol_version field is not provided. This will cause a compilation error.

Fix: Add protocol_version: input.protocol_version to the struct initialization.

8. Incorrect Variable Reference in ping_task.rs

File: ping_task.rs:38

let wf = if protocol::is_mk2(conn.protocol_version) {  // ✓ Correct
    // ...
} else {
    // ...
};

// Later on line 75:
if protocol::is_mk2(conn.protocol_version) {  // ❌ Should be consistent

The check uses conn.protocol_version but the function definition suggests it should use the workflow's stored protocol version. Verify this is intentional.


📋 Code Quality Issues

9. Inconsistent Error Handling Pattern

The error handling varies between graceful degradation and hard failures without clear rationale:

  • actor_event_demuxer.rs:119: Logs warning but continues
  • tunnel_to_ws_task.rs: Returns error immediately on parse failure (after fix)

Recommendation: Document the error handling strategy - when to fail fast vs. gracefully degrade.

10. Missing Documentation

The new ActorEventDemuxer is a core component but lacks module-level documentation explaining:

  • Why batching is beneficial
  • The GC strategy and its implications
  • Thread safety considerations

🔍 Potential Bugs

11. Race Condition in Initialization

File: ws_to_tunnel_task.rs:46-55 and tunnel_to_ws_task.rs:23-32

For mk2 protocol, tunnel_to_ws task waits for init_rx signal before processing, while ws_to_tunnel sends it. However:

  • What happens if the init signal is sent before the tunnel_to_ws task starts waiting?
  • watch::channel(()) doesn't buffer, so early sends might be lost

Recommendation: Verify the initialization sequence is deterministic or use a different synchronization primitive.

12. Incomplete Migration Path

The code supports both mk1 (v3) and mk2 (v4) protocols but:

  • No migration strategy documented
  • No metrics/logging to track which protocol versions are in use
  • No deprecation timeline for mk1

Recommendation: Add telemetry to track protocol version usage.


🧪 Testing Concerns

  • Missing Test Coverage: This is a major refactor with no apparent test additions
  • Protocol Compatibility: No evidence of tests validating mk1/mk2 compatibility
  • Error Paths: Edge cases like GC during active event processing aren't tested

Recommendation: Add integration tests covering:

  1. Protocol version negotiation
  2. ActorEventDemuxer GC behavior
  3. Graceful shutdown sequences

📝 Minor Issues

  1. Typo in Function Name: is_mk2 could be more descriptive as is_protocol_v4_or_later
  2. Magic Numbers: EVENT_ACK_BATCH_SIZE = 500 and batch size 1024 in ActorEventDemuxer should be constants with explanatory comments
  3. Inconsistent Naming: mk2 vs v4 - pick one naming convention

🎬 Action Items

Before merging:

  1. ✅ Fix all compilation errors (issues [SVC-2555] Set up issue templates #1-4)
  2. ✅ Fix the memory leak in ActorEventDemuxer ([SVC-2358] Enable Redis memory overcommit #5)
  3. ✅ Add protocol_version to InsertDbInput instantiation ([SVC-2404] Improve internal documentation #7)
  4. ⚠️ Improve error handling for silent failures (Remove servers not in salt #6)
  5. ⚠️ Add tests for new functionality
  6. ⚠️ Verify initialization race condition ([SOC-15] Fix error when running user-delete #11)

Future improvements:

  • Add documentation for ActorEventDemuxer
  • Add protocol version metrics
  • Create migration guide for mk1 → mk2

📊 Overall Recommendation

Do not merge yet - there are critical compilation errors that need to be fixed first. Once the compilation issues are resolved and the memory leak is addressed, this refactor looks promising. The architecture simplification is a good direction, but the implementation needs polish.

Please address the critical issues and consider adding tests before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pb): simplify runner wf

I've reviewed this PR which introduces a new protocol version (v4/mk2) for the pegboard runner workflow and adds an event demuxer for actor events.

Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs

  • Line 21: conn.protocol should be conn.protocol_version
  • Lines 36-44, 62-76: recv_msg() and handle_message_mk2() are called without required parameters
  • Line 51: Function signature declares no parameters but references tunnel_sub, eviction_sub, and tunnel_to_ws_abort_rx
  • Line 79: Function signature declares no parameters but uses tunnel_msg, ctx, and conn

These helper functions need to either accept the necessary parameters in their signatures OR be moved inside the main function as closures.

2. Compilation Errors in ws_to_tunnel_task.rs
Similar scoping issues with recv_msg() and handle_message_mk2() not receiving proper parameters.

3. Missing Signal Handling in conn.rs
Line 177 uses ctx.signal(Init) but Init is not defined, imported, or given a workflow target.

High Priority Issues

4. Potential Memory Leak in ActorEventDemuxer
The last_seen field is never updated after initial creation (line 64), so GC will incorrectly remove channels after 30 seconds. Use get_mut to update timestamp on each ingest.

5. Error Handling Silently Drops Send Failures
Line 38 in actor_event_demuxer.rs silently drops events if send fails. Should log warnings.

6. Protocol Version Checking Inconsistencies
conn.rs:150 uses protocol::is_new() which is not defined. Should be is_mk2?

Medium Priority Issues

7. Missing Protocol Version Field
InsertDbInput struct defines protocol_version field but never sets it when created (lines 74-82 of runner2.rs).

8. Incorrect Message Type Usage
Line 185 of runner2.rs uses protocol::ToRunner::ToRunnerClose but for mk2 should be protocol::mk2::ToRunner::ToRunnerClose.

Code Quality

Good:

  • Event batching up to 1024 events is excellent
  • Graceful shutdown waits for in-flight events
  • Proper structured logging

Suggestions:

  • Add metrics for demuxer channel count
  • Add tests for protocol version branching

Summary

Cannot merge as-is due to compilation errors. Priority fixes:

  1. Fix function signatures in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs
  2. Fix or define Init signal in conn.rs
  3. Update last_seen timestamp in demuxer
  4. Add missing protocol_version field to InsertDbInput
  5. Define or remove protocol::is_new()

Once addressed, this is a solid simplification!

@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from c088d42 to 657b813 Compare November 21, 2025 19:57
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 5458e50 to 8431b53 Compare November 21, 2025 19:57
@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: fix(pb): simplify runner wf

Summary

This PR introduces a new protocol version (mk2/v4) for the pegboard runner workflow while maintaining backwards compatibility with the existing mk1 protocol. The goal appears to be simplifying the runner workflow architecture by moving certain processing logic from the workflow to the WebSocket task handlers.


🔴 Critical Issues - Code Won't Compile

The PR has several critical compilation errors that must be fixed before merging:

1. Functions with missing/incorrect parameters (tunnel_to_ws_task.rs)

The helper functions recv_msg(), handle_message_mk2(), and handle_message_mk1() are defined without proper parameters but reference variables from the outer scope (tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx, ctx, conn). These functions need to either:

  • Accept these as parameters
  • Be closures instead of standalone functions
  • Be refactored to properly capture context
// Line 52-78: recv_msg() references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx but takes no parameters
async fn recv_msg() -> Result<std::result::Result<(), LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // tunnel_sub is not in scope!
// Line 80: handle_message_mk2() has no parameters but references tunnel_msg, ctx, conn
async fn handle_message_mk2() -> Result<()> {
// Line 166: handle_message_mk1() has same issue plus uses `continue` outside a loop
async fn handle_message_mk1() -> Result<()> {
    // ...
    continue; // Line 201 - ERROR: `continue` used outside of loop

2. Functions with missing/incorrect parameters (ws_to_tunnel_task.rs)

Similar issues in ws_to_tunnel_task.rs:

// Line 79: Returns wrong type, takes no parameters but needs ws_rx, eviction_sub2, etc.
async fn recv_msg() -> Result<std::result::Result<Option<()>, LifecycleResult>> {
// Line 119-125: Function signature has parameters but doesn't use them correctly, msg param is ()
async fn handle_message_mk2(
    ctx: &StandaloneCtx,
    conn: &Arc<Conn>,
    init_tx: &watch::Sender<()>,
    event_demuxer: &mut ActorEventDemuxer,
    msg: (), // Should be the message data, not unit type
) -> Result<()> {
    let msg = match versioned::ToServerMk2::deserialize(&data, ...) // `data` is not defined
// Line 467: Similar issues
async fn handle_message_mk1(ctx: &StandaloneCtx, conn: &Arc<Conn>, msg: ()) -> Result<()> {

3. Undefined variables referenced

Multiple places reference data which is never defined:

  • tunnel_to_ws_task.rs:82 - tunnel_msg.payload should be used
  • ws_to_tunnel_task.rs:128 - data is undefined
  • ws_to_tunnel_task.rs:470 - data is undefined
  • ws_to_tunnel_task.rs:479 - data is undefined

4. continue outside loop context

  • tunnel_to_ws_task.rs:201 - Uses continue in a function that returns from a match arm

5. Incomplete function implementations

// ws_to_tunnel_task.rs:761-768 - Function body is just comments
async fn ack_commands(ctx: &StandaloneCtx) -> Result<()> {
    // ctx.udb()?.run(|| {
    // ...
    // }).await?;
}

6. Missing signal type and undefined symbols

  • conn.rs:177-178 - ctx.signal(Init) - Init struct is referenced but not imported/fully qualified
  • conn.rs:130 - protocol::is_new function doesn't exist (should be protocol::is_mk2)
  • tunnel_to_ws_task.rs:22 - References conn.protocol but field is conn.protocol_version
  • tunnel_to_ws_task.rs:96,101 - protocol::ToGateway::ToGatewayPong vs protocol::mk2::ToGateway::ToGatewayPong mixing

7. Undefined types/modules referenced

  • ws_to_tunnel_task.rs:775 - protocol::mk2::ToServerTunnelMessage - Need to verify mk2 module exports this
  • ws_to_tunnel_task.rs:785 - protocol::mk2::compat::version_needs_tunnel_ack - Module may not exist
  • ws_to_tunnel_task.rs:807 - tunnel_id::parse_message_id - Module not imported

🟡 Code Quality Issues

1. Code duplication

The KV handling code in handle_message_mk2() and handle_message_mk1() is nearly identical (lines 150-375 and 510-735 in ws_to_tunnel_task.rs). Consider extracting this into a shared helper function.

2. Missing return statement (ws_to_tunnel_task.rs:759)

async fn handle_message_mk1(...) -> Result<()> {
    // ... match arms handle all cases but function doesn't explicitly return
}

3. Inconsistent error messages

In ws_to_tunnel_task.rs, many error contexts say "failed to serialize KV delete response" even when handling init or commands packets (lines 415-434).


🟡 Architecture Concerns

1. ActorEventDemuxer cleanup logic

In actor_event_demuxer.rs:36-67, when a new event arrives for an actor_id that already has a channel, the last_seen timestamp is never updated:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);
        // BUG: last_seen is never updated here!
    } else {
        // ... creates new channel with last_seen: Instant::now()
    }

This means active channels could be garbage collected prematurely.

2. Unbounded channel risk

The ActorEventDemuxer uses mpsc::unbounded_channel(). If the dispatch task falls behind or hangs, this could lead to memory exhaustion. Consider using a bounded channel with backpressure.

3. Error handling in demuxer

The actor event processor breaks on any error (line 53-55), which means a transient error could permanently stop event processing for an actor.


🟢 Positive Aspects

  1. Good backward compatibility approach - The mk1/mk2 protocol split allows gradual migration
  2. Proper shutdown handling in ActorEventDemuxer::shutdown() that awaits all tasks
  3. Good use of tracing instrumentation throughout the new code
  4. Protocol versioning is well-structured with the new v4.bare schema

📋 Recommendations

  1. Fix all compilation errors before review can continue - the code in its current state cannot be tested
  2. Add unit tests for the new ActorEventDemuxer component
  3. Extract shared KV handling code to reduce duplication
  4. Consider bounded channels in the demuxer to prevent memory issues
  5. Fix the last_seen update bug in the demuxer
  6. Clean up commented code in actor/mod.rs (lines 322-346, 352-370) - either implement the TODO or remove

Files Requiring Attention

File Priority Issue
tunnel_to_ws_task.rs 🔴 Critical Won't compile - function scoping issues
ws_to_tunnel_task.rs 🔴 Critical Won't compile - undefined variables, wrong types
conn.rs 🔴 Critical Undefined Init signal, wrong function name
actor_event_demuxer.rs 🟡 Medium Logic bug with last_seen
runner2.rs 🟢 Low Missing protocol_version in InsertDbInput

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: fix(pb): simplify runner wf

Summary

This PR introduces a significant refactoring of the Pegboard runner workflow system, adding a "mk2" protocol version (v4) while maintaining backward compatibility with mk1 (v1-v3). The changes simplify the runner workflow by moving more event handling logic from the workflow itself to the WebSocket task layer.

Critical Issues

1. Compilation Errors - Code Will Not Build

Multiple files have severe issues that will prevent compilation:

tunnel_to_ws_task.rs:

  • Lines 37-48: recv_msg() and handle_message_mk2()/handle_message_mk1() are called without required arguments. These functions need ctx, conn, etc.
  • Line 52: recv_msg() function has no parameters but references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx which are not in scope
  • Lines 80-163: handle_message_mk2() takes no parameters but references tunnel_msg, ctx, conn - undefined variables
  • Lines 166-252: Same issue with handle_message_mk1() - uses continue outside of loop context (line 201)

ws_to_tunnel_task.rs:

  • Lines 62-75: Same pattern - recv_msg() called without context, handle_message_mk2() and handle_message_mk1() called without required arguments
  • Lines 79-116: recv_msg() references ws_rx, eviction_sub2, ws_to_tunnel_abort_rx which are not in scope
  • Lines 119-463: handle_message_mk2() function signature has msg: () but body tries to use data variable
  • Lines 395-435: References undefined variables input, runner_lost_threshold, last_command_idx, prepopulate_actor_names, metadata
  • Lines 761-768: ack_commands() function has an empty body with only comments

conn.rs:

  • Line 178: ctx.signal(Init); - Init is not imported and signal is not sent (missing .send().await)

2. Missing Import - is_new Function

  • conn.rs:130 calls protocol::is_new(protocol_version) but this function is not defined in lib.rs. Only is_mk2 is defined.

3. actor_event_demuxer.rs - Logic Bug

  • Lines 36-67: When a new actor_id is encountered, a channel is created and the spawn happens, but the initial event is never sent to the channel. The tx.send(event) only happens if the channel already exists.

Code Quality Issues

4. Code Duplication

The mk1/mk2 split introduces significant code duplication:

  • handle_message_mk1 and handle_message_mk2 in both tunnel_to_ws_task.rs and ws_to_tunnel_task.rs have nearly identical KV handling code
  • handle_tunnel_message_mk1 and handle_tunnel_message_mk2 duplicate tunnel message handling logic

Consider extracting shared logic into helper functions.

5. Incomplete Signal Definition

runner2.rs defines signals but there's no Init signal defined in the file:

join_signal!(Main {
    Forward(Forward),
    CheckQueue,
    Stop,
});

Where is Init used in the match ctx.listen_with_timeout::<Main> at line 72?

6. InsertDbInput Missing Field

runner2.rs:289-299: InsertDbInput has protocol_version: u16 field but the activity call at line 74-83 doesn't pass this field.

Performance Considerations

7. Unbounded Channel in ActorEventDemuxer

Using mpsc::unbounded_channel() could lead to memory issues if events arrive faster than they can be processed. Consider using a bounded channel with backpressure.

8. GC Implementation

The GC in actor_event_demuxer.rs never updates last_seen after initial insert. The channel will always be GC'd after MAX_LAST_SEEN regardless of activity:

if let Some(channel) = self.channels.get(&actor_id) {
    let _ = channel.tx.send(event);  // last_seen never updated here
}

Security Considerations

9. Error Messages Leak Internal Details

Multiple places return internal error messages to clients:

message: err.to_string(),  // TODO comments note this

The TODO comments acknowledge this should be addressed.

Test Coverage

No test files were included in this PR. Given the complexity of the protocol version handling and the critical nature of the runner workflow, tests should be added for:

  • Protocol version detection and routing
  • Event demuxing behavior
  • Backward compatibility with mk1 clients

Recommendations

  1. Fix compilation errors - The code needs significant fixes before it can even build
  2. Add the missing Init signal definition and ensure it's properly handled
  3. Fix the actor_event_demuxer to send the initial event when creating a new channel
  4. Consider bounded channels for the event demuxer to prevent memory issues
  5. Add unit tests for the new protocol handling logic
  6. Reduce duplication by extracting common code between mk1/mk2 handlers

Overall Assessment

This PR introduces important architectural changes but appears to be incomplete. The code has multiple compilation errors and logic bugs that need to be addressed before merge.

@claude
Copy link

claude bot commented Nov 22, 2025

PR Review: fix(pb): simplify runner wf

This PR introduces significant architectural changes to the Pegboard runner workflow, including a new protocol version ("mk2") and refactoring of the connection handling and event dispatching. Here's my detailed review:


🔴 Critical Issues

1. Compilation Errors - Missing Variables in Scope

Several functions reference variables that are not in scope, indicating the code is incomplete or won't compile:

tunnel_to_ws_task.rs:

  • recv_msg() (line 52-78) references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx but these are not passed as parameters
  • handle_message_mk2() (line 80) and handle_message_mk1() (line 166) reference tunnel_msg, ctx, conn without parameters
  • Line 201 in handle_message_mk1() uses continue outside a loop context

ws_to_tunnel_task.rs:

  • recv_msg() (line 79-117) references ws_rx, eviction_sub2, ws_to_tunnel_abort_rx without them being parameters
  • handle_message_mk2() (line 120-125) has incorrect parameter msg: () type but uses data variable inside
  • handle_message_mk1() (line 467) references data which is not a parameter
  • Line 395-403: handle_message_mk2 references undefined input, last_command_idx, prepopulate_actor_names, metadata, runner_lost_threshold
  • Lines 415-434: Uses res_msg and res_msg_serialized instead of msg and msg_serialized

2. Incomplete Implementation

  • ack_commands() function (line 761-768 in ws_to_tunnel_task.rs) has empty body with commented-out code
  • Multiple commented-out // TODO: Send message to tunnel blocks in actor/mod.rs and actor/runtime.rs indicate unfinished migration

3. Missing Signal Struct

  • conn.rs line 177-178: ctx.signal(Init); - The Init signal is referenced but the struct import/definition is not shown. Should likely be ctx.signal(pegboard::workflows::runner2::Init {}).

🟡 Potential Bugs

1. ActorEventDemuxer - Event Loss on First Ingest

In actor_event_demuxer.rs line 36-67, when a new actor channel is created, the initial event is never sent to the new channel:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);  // Existing channel - event sent
    } else {
        // New channel created... but `event` is never sent!
        let (tx, mut rx) = mpsc::unbounded_channel();
        // ...
        self.channels.insert(actor_id, Channel { tx, handle, last_seen: Instant::now() });
    }
    // ... gc code
}

The first event for each actor will be dropped.

2. GC Never Updates last_seen

The last_seen field in Channel struct is set at creation time but never updated when events are received. This means active channels may be incorrectly garbage collected after 30 seconds.

3. Missing ToRunnerClose Rename

In pegboard-serverless/src/lib.rs line 523, the code was changed to use ToRunnerClose but this appears to be a protocol rename. Need to verify the protocol definition matches.


🟠 Code Quality Issues

1. Massive Code Duplication

handle_message_mk1 and handle_message_mk2 in ws_to_tunnel_task.rs share ~90% identical code (KV operations, error handling). Consider extracting common logic into shared functions.

2. Inconsistent Error Message Context

In handle_message_mk2 lines 415-434, error contexts say "failed to serialize KV delete response" for non-KV-delete operations:

let msg_serialized = res_msg
    .serialize(conn.protocol_version)
    .context("failed to serialize KV delete response")?;  // Wrong context!

3. Protocol Version Checking Functions

Multiple functions like is_mk2(), is_new() are used inconsistently throughout the codebase. Consider documenting the protocol version scheme more clearly.


🔵 Architecture Observations

  1. Protocol Versioning Strategy: The PR introduces a parallel protocol path (mk1 vs mk2). This is a reasonable approach for backwards compatibility, but ensure there's a clear deprecation timeline.

  2. Event Demuxer Pattern: The ActorEventDemuxer with batched event dispatch is a good pattern for reducing signal overhead.

  3. Init Synchronization: Using watch::channel for init synchronization between tasks is appropriate.


📋 Recommendations

  1. Fix all compilation errors before merging - the functions need proper parameter passing
  2. Complete the ack_commands() implementation or remove it if not needed
  3. Fix the ActorEventDemuxer to send the first event and update last_seen on each ingest
  4. Add tests for the new protocol version handling and event demuxer
  5. Extract shared KV handling code to reduce duplication
  6. Address all TODO comments for sending messages to tunnel

Summary

This PR has significant architectural value in simplifying the runner workflow, but is not ready to merge due to compilation errors and incomplete implementations. The code structure is good, but needs completion of the migration work.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 22, 2025

PR Review: fix(pb): simplify runner wf

Summary

This PR introduces significant changes to the Pegboard runner system, introducing a new protocol version (v4/mk2) alongside maintaining backwards compatibility with v1-v3 (mk1). The main goal appears to be simplifying the runner workflow by moving more logic into the pegboard-runner service rather than the workflow itself.

Critical Issues

1. Compilation Errors - Missing Variables in Closures

Several functions in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs reference variables like ctx, conn, tunnel_msg, data that are not passed as parameters or captured:

tunnel_to_ws_task.rs:52-77 - recv_msg() references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx without them being passed:

async fn recv_msg() -> Result<std::result::Result<(), LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // tunnel_sub not in scope

tunnel_to_ws_task.rs:80-163 - handle_message_mk2() references ctx, conn, tunnel_msg without parameters:

async fn handle_message_mk2() -> Result<()> {
    // Parse message
    let msg = match versioned::ToRunner2::deserialize_with_embedded_version(&tunnel_msg.payload) {
        // tunnel_msg not in scope, ctx not in scope

ws_to_tunnel_task.rs:79-117 - Same issue with recv_msg() referencing ws_rx, eviction_sub2, ws_to_tunnel_abort_rx

ws_to_tunnel_task.rs:119-464 - handle_message_mk2() has function signature msg: () but references data, conn, ctx, init_tx, etc.

ws_to_tunnel_task.rs:467-759 - handle_message_mk1() similarly references undefined variables

2. Incomplete Function Body

ws_to_tunnel_task.rs:761-768 - ack_commands() has an empty/incomplete body:

async fn ack_commands(ctx: &StandaloneCtx) -> Result<()> {
    // ctx.udb()?.run(|| {
    //     let last_ack = ;
    //     let stream = .read_ranges_keyvalues({
    //         limit:
    //     });
    // }).await?;
}

This function is called but does nothing and doesn't return properly.

3. Missing Signal Definition

conn.rs:177-178 - Init signal is referenced but appears undefined:

if protocol::is_new(protocol_version) {
    ctx.signal(Init); // Init is not defined anywhere visible

4. Use of continue in Non-Loop Context

tunnel_to_ws_task.rs:201 - continue statement inside a function, not a loop:

protocol::ToRunner::ToRunnerPing(ping) => {
    // ...
    continue; // This is inside handle_message_mk1, not a loop
}

Code Quality Issues

5. Missing last_seen Update in ActorEventDemuxer

actor_event_demuxer.rs:36-67 - When an event is ingested for an existing channel, last_seen is never updated, which could cause premature GC:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);
        // Missing: channel.last_seen = Instant::now();
    }

6. Event Not Sent on First Ingest

actor_event_demuxer.rs:36-67 - When a new channel is created, the initial event is never sent to it. The code creates the channel but doesn't send the first event:

} else {
    let (tx, mut rx) = mpsc::unbounded_channel();
    // ... spawn task ...
    self.channels.insert(actor_id, Channel { tx, handle, last_seen: Instant::now() });
    // Missing: tx.send(event) for the initial event
}

7. Return Type Mismatch

tunnel_to_ws_task.rs:52 - recv_msg() returns Result<std::result::Result<(), LifecycleResult>> but line 77 returns Ok(Ok(tunnel_msg)) where tunnel_msg is not ().

ws_to_tunnel_task.rs:79 - Similar issue - returns Result<std::result::Result<Option<()>, LifecycleResult>> but line 106 returns Ok(Some(data)) where data is Bytes.

8. Inconsistent Protocol Function Naming

lib.rs:14-16 uses is_mk2() but conn.rs:130 uses is_new():

// lib.rs
pub fn is_mk2(protocol_version: u16) -> bool { ... }

// conn.rs 
if protocol::is_new(protocol_version) { // is_new doesn't exist

Missing Return Statements

ws_to_tunnel_task.rs:758-759 - handle_message_mk1 doesn't return anything despite being declared as returning Result<()>

Protocol Design Concerns

9. Missing ToClientAckEvents and ToClientKvResponse in mk2

The v4 schema includes these in ToClient, but handle_message_mk2 in tunnel_to_ws_task.rs only handles ToClientCommands and ToClientTunnelMessage. The mk2 protocol may need handlers for ack events and KV responses.

Recommendations

  1. Fix compilation errors - The extracted helper functions need their required context passed as parameters
  2. Complete ack_commands() implementation or remove the call
  3. Add the Init signal definition or import
  4. Fix last_seen updates in the demuxer to prevent premature garbage collection
  5. Send initial event when creating a new channel in the demuxer
  6. Fix return types to match actual return values
  7. Consider adding tests - This is a significant refactor with new protocol version; tests would help verify behavior

Overall Assessment

This PR introduces important architectural changes but appears to be in an incomplete state. The code has significant compilation issues that would prevent it from building. I recommend addressing the critical issues before merging.


🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants