Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Jan 13, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 13, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3862

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3862

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3862

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3862

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3862

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3862

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3862

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3862

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3862

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3862

commit: 920b1d9

@claude
Copy link

claude bot commented Jan 13, 2026

PR Review: fix(guard): clean up guard

This PR removes the request context tracking system and simplifies guard architecture by removing per-actor tracking. Overall, the changes improve code maintainability by removing unused functionality. Here are my findings:

Strengths

  1. Significant code reduction - Removes 184 lines from request_context.rs and 199 net lines from proxy_service.rs, improving maintainability
  2. Cleaner API surface - The CustomServeTrait now uses explicit ray_id/req_id parameters instead of a stateful RequestContext
  3. Simplified rate limiting - Changed from per-actor rate limiting to per-IP, which is more straightforward
  4. Removed unused ClickHouse dependency - Eliminates the analytics insertion code that was not being used

Code Quality Issues

  1. Missing error context in StandaloneCtx::with_ray (gasoline/src/ctx/standalone.rs:70)

    • The new with_ray method calls StandaloneCtx::new(...)? but does not add context to the error
    • Should use .context for better debugging
  2. Inconsistent metric labels (guard/src/routing/mod.rs)

    • Line 130: Uses uppercase GATEWAY
    • Line 147: Uses uppercase API
    • Line 75, 85, 114: Uses lowercase gateway, runner
    • Recommendation: Standardize all metric labels to lowercase for consistency
  3. Hardcoded default values in middleware (guard/src/middleware.rs:20-34)

    • The middleware function returns hardcoded values (10000 requests, 2000 concurrent, etc.)
    • The comment says "In a real implementation, you would look up actor-specific middleware settings"
    • Question: Is this intentional for this PR or should this be configurable?
  4. Rate limiting logic change may affect production (proxy_service.rs:510-548)

    • Changed from (actor_id, ip_addr) cache key to just ip_addr
    • This means rate limits are now only per-IP, not per-actor-per-IP
    • Security concern: This could allow clients to hit the same actor from different IPs to bypass limits, or block legitimate traffic to different actors from the same IP
    • The default values also changed significantly: Old actor-specific: 100 req/60s, 20 concurrent. New global: 10000 req/60s, 2000 concurrent
    • Recommendation: Verify this is the intended behavior change and document it in the PR description
  5. Removed actor_id from error messages (errors.rs:6-14)

    • The RateLimit error no longer includes actor_id in the message
    • This makes debugging rate limit issues harder
    • Recommendation: Consider keeping the actor info in logs even if not in the error struct

Potential Bugs

  1. Routing function signature mismatch (guard/src/routing/mod.rs:40)

    • The closure calls ctx.with_ray(ray_id, req_id).unwrap()
    • Using .unwrap() in production routing code is dangerous - it will panic on failure
    • Fix: Change to proper error handling
  2. No WebSocket close frame change verification (proxy_service.rs:2405)

    • Changed from Option to Id for ray_id parameter
    • The function err_to_close_frame now always includes ray_id in the close reason
    • Verify this does not break WebSocket clients that expect the old format

Performance Considerations

  1. Cache key simplification - Changing from tuple keys (Id, IpAddr) to just IpAddr should improve cache performance
  2. Removed analytics overhead - No more spawning tasks to insert ClickHouse events on every request is a good performance win

Security Concerns

  1. Rate limiting scope change - Per-IP instead of per-actor-per-IP could have security implications
  2. Error information leakage - Verify that removing actor_id from errors does not expose information about what actors exist

Test Coverage

  • Missing: No new tests added for the with_ray method
  • Missing: No tests verifying the new rate limiting behavior
  • Recommendation: Add tests to verify rate limits work correctly with just IP addresses, the with_ray method preserves from_workflow flag, and WebSocket close frames include ray_id correctly

Documentation

  • Missing: PR description does not explain why the request context tracking was removed or what the implications are
  • Missing: No migration notes for any breaking API changes
  • Recommendation: Add a brief explanation of the architectural change and its benefits

Style and Conventions

  1. Per CLAUDE.md, comments should be complete sentences - most comments follow the guidelines
  2. No glob imports detected - good adherence to CLAUDE.md guidelines

Summary

This is a solid cleanup PR that removes unused functionality. The main concerns are:

  1. Critical: Fix the .unwrap() in the routing function (routing/mod.rs:40)
  2. Important: Verify the rate limiting behavior change is intentional and document it
  3. Nice to have: Standardize metric labels, add error context, improve test coverage

The architectural simplification is good, but make sure the rate limiting changes align with product requirements before merging.

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from c079c96 to 9e8cf3d Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 3df8f27 to a2b521a Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 9e8cf3d to 3008b28 Compare January 14, 2026 02:53
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from a2b521a to 061a672 Compare January 14, 2026 02:53
Copy link
Contributor Author

@NathanFlurry make sure to double check that re-enabling rate limiters and route cache here (now that its not per actor id but just IP) is fine

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 3008b28 to 70900c0 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 061a672 to 535af14 Compare January 14, 2026 03:06
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crank limit to nutso levels

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 70900c0 to 5f50433 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 535af14 to 24b4824 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 5f50433 to d4e5e20 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 24b4824 to 45cdaf0 Compare January 14, 2026 20:01
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate limit looks unchanged??

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from d4e5e20 to f831199 Compare January 14, 2026 22:45
Copy link
Contributor Author

I'm not sure where the changes went, should be there now

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from f831199 to c90f0b8 Compare January 14, 2026 22:47
@claude
Copy link

claude bot commented Jan 14, 2026

Pull Request Review: fix(guard): clean up guard

Summary

This PR significantly simplifies the guard proxy service by removing the request analytics/context tracking system and refactoring the middleware and routing layers to pass ray_id and req_id directly instead of maintaining per-actor state. The changes result in a net reduction of 532 lines of code.


✅ Strengths

  1. Excellent Code Reduction: Removing 532 lines while adding only 166 shows effective simplification
  2. Cleaner Interface: Passing ray_id/req_id explicitly to CustomServeTrait methods is clearer than implicit context
  3. Simplified Rate Limiting: Removing per-actor rate limiting in favor of IP-based limiting reduces complexity
  4. Good Use of New Context Method: The StandaloneCtx::with_ray() method (gasoline/src/ctx/standalone.rs:70-84) is a clean addition

🔴 Critical Issues

1. Removed Analytics Without Migration Path (proxy_service.rs)

The entire RequestContext and ClickHouse analytics system was removed:

  • Deleted request_context.rs (184 lines)
  • Removed clickhouse_inserter parameter from ProxyState
  • No migration notes or feature flag

Impact: Complete loss of request analytics data. If this data was used for monitoring, debugging, or billing, this is a breaking change.

Recommendation:

  • Document why analytics were removed in the PR description
  • If analytics are still needed, consider re-implementing them in a simpler form
  • Add a TODO or issue reference if this will be replaced

@claude
Copy link

claude bot commented Jan 14, 2026

🔴 Critical Issues (continued)

2. Middleware Function Signature Breaking Change (proxy_service.rs:210-217)

The middleware function signature changed, removing the actor_id parameter.

Impact: This breaks all existing middleware implementations. While guard/src/middleware.rs was updated, any external implementations will break.

Recommendation: Document this breaking change in commit message or PR description

3. Rate Limiting Behavior Change (proxy_service.rs:510-548)

Previously, rate limiting was per-actor + IP. Now it's only per-IP.

Impact:

  • Users hitting multiple actors from the same IP now share a rate limit pool
  • Could allow one user to DOS multiple actors
  • Removes ability to have per-actor rate limit policies

Recommendation: Reconsider if this is the desired behavior. If per-actor limits are needed, you may need to keep actor_id in the cache key.

4. Default Middleware Config Change (proxy_service.rs:455-488)

The fallback middleware config changed significantly:

  • Rate limit: 100 → 10000 requests per 60s (100x increase)
  • Max in-flight: 20 → 2000 (100x increase)
  • Max retry attempts: 3 → 7
  • Initial retry interval: 100ms → 150ms

Impact: This is a 100x increase in rate limits and concurrent requests. While more permissive, it may expose services to higher load than expected and could mask performance issues during development.

Recommendation: Document this change and ensure it aligns with intended behavior.

@claude
Copy link

claude bot commented Jan 14, 2026

⚠️ Important Issues

5. Error Message Breaking Change (errors.rs:6-15)

Removed actor_id from RateLimit error message.

Impact: Reduces debuggability. When investigating rate limit issues, knowing which actor was targeted is valuable.

Recommendation: Consider keeping actor_id in the error if it's available from context.

6. Inconsistent Metric Labels (routing/mod.rs:75, 130, 147)

Different cases used for the same metric:

  • Line 75: "gateway" (lowercase)
  • Line 130: "GATEWAY" (uppercase)
  • Line 147: "API" (uppercase)
  • Line 138: "api" (lowercase)

Impact: Metrics will be split across different labels, making monitoring harder.

Recommendation: Standardize to lowercase (e.g., "gateway", "api", "runner").

7. Missing Error Handling in Routing (routing/mod.rs:40)

Uses .unwrap() when creating context with ray IDs.

Impact: Using .unwrap() in production code can cause panics. If with_ray() can fail, this should be handled gracefully.

Recommendation: Use ? operator and return a proper error.

8. Potential Denial of Service (proxy_service.rs:345)

The in-flight request cache has a very high limit of 10,000,000 entries.

Impact: 10 million entries could consume significant memory. Each entry stores at minimum a request ID (likely 16-32 bytes) plus cache overhead.

Recommendation: Consider if this limit is realistic or if it should be lower with proper backpressure.

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from c90f0b8 to 9d72300 Compare January 14, 2026 22:52
@claude
Copy link

claude bot commented Jan 14, 2026

💡 Suggestions & 🧪 Testing Concerns

9. Documentation Needed for with_ray (gasoline/src/ctx/standalone.rs:70)

The new with_ray() method lacks doc comments explaining:

  • When to use it vs new()
  • Why from_workflow is preserved
  • What scenarios might cause it to fail

10. Comment Style Improvements (custom_serve.rs:38-39)

Per CLAUDE.md, write comments as complete sentences. Avoid fragmented structures.

🧪 Testing Concerns

No test coverage found for guard-core changes. Given the significant behavioral changes to:

  • Rate limiting (per-actor → per-IP)
  • Middleware configuration (100x limit increase)
  • Retry behavior (3 → 7 attempts)

Recommendation: Add integration tests covering:

  • Rate limiting behavior with multiple IPs
  • Middleware config application
  • Retry logic with the new defaults
  • Error handling for with_ray() failures

📋 Style & Conventions

Generally follows CLAUDE.md guidelines:

  • ✅ Uses workspace dependencies correctly
  • ✅ Structured logging with tracing
  • ✅ Proper error handling with RivetError derive
  • ⚠️ Comment style could be improved in a few places
  • ⚠️ Some unwraps should use ? operator

@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 45cdaf0 to 445bb05 Compare January 14, 2026 22:52
@claude
Copy link

claude bot commented Jan 14, 2026

Pull Request Review: fix(guard): clean up guard

Summary

This PR performs a significant cleanup of the guard system by removing the RequestContext analytics tracking infrastructure and simplifying the middleware and routing systems. The changes remove actor-specific rate limiting in favor of IP-based rate limiting and pass ray_id/req_id directly through the call chain.

Positive Changes

Simplified Architecture: Removing the RequestContext reduces complexity and removes unused analytics infrastructure

Cleaner API Surface: The CustomServeTrait now has a cleaner interface with explicit ray_id and req_id parameters

Consistent Middleware: Middleware configuration is now uniform and not actor-specific

Better Default Limits: Increased default rate limits (100→10000 requests/min) and in-flight limits (20→2000) are more reasonable

Issues & Concerns

🔴 Critical Issues

1. Removed Actor-Level Rate Limiting (proxy_service.rs:513-542)

  • Previously: Rate limiting was per-actor AND per-IP: (actor_id, ip_addr)
  • Now: Only per-IP: ip_addr
  • Impact: Multiple actors from the same IP (e.g., behind NAT) now share rate limits
  • Security Risk: One misbehaving actor can rate-limit all actors from the same IP
  • Recommendation: If actor-level isolation is no longer needed, this should be explicitly documented in the PR description

2. Removed ClickHouse Analytics (proxy_service.rs:217, 327)

  • Completely removed clickhouse_inserter and all analytics event tracking
  • Impact: Loss of observability - no more tracking of request/response sizes, timings, status codes, etc.
  • Questions: Is this analytics data used elsewhere in the system? Are there alternative monitoring solutions in place? Should this removal be in a separate PR with migration plan?

3. Middleware Function Signature Change (proxy_service.rs:212-219)

  • Removed actor_id parameter from MiddlewareFn
  • Impact: Middleware can no longer make actor-specific decisions
  • Breaking Change: Any custom middleware implementations will break

⚠️ Medium Priority Issues

4. Error Message Regression (errors.rs:6-9)

  • Removed actor_id field from RateLimit error
  • Impact: Reduced debuggability - cannot tell which actor is being rate-limited from error message alone

5. Middleware Default Configuration Duplication (middleware.rs:20-35, proxy_service.rs:469-480)

  • The same default middleware config appears in two places
  • Issue: Violates DRY principle, prone to drift

6. RouteTarget Simplification (proxy_service.rs:112)

  • Removed actor_id field from RouteTarget
  • Impact: Downstream code can no longer correlate requests to actors

🟡 Minor Issues

7. Inconsistent Metric Labels (routing/mod.rs:130 vs 147)

  • Line 130: .with_label_values(&["GATEWAY"]) (uppercase)
  • Line 147: .with_label_values(&["API"]) (uppercase)
  • Line 76: .with_label_values(&["gateway"]) (lowercase)
  • Issue: Inconsistent casing will create separate metric series

8. Missing Error Logging Context (proxy_service.rs:458)

  • When get_middleware_config returns NotFound, it uses hardcoded defaults silently

9. Comment Quality (gasoline/ctx/standalone.rs:69)

  • New function with_ray lacks documentation comment

10. Potential Panic (routing/mod.rs:40)

  • ctx.with_ray(ray_id, req_id).unwrap() will panic on error
  • Fix: Replace with ? operator and handle error properly

Code Quality Observations

Good Practices ✅

  • Proper use of tracing instrumentation
  • Consistent error handling with anyhow::Result
  • Clean removal of dead code paths
  • Maintained proper async/await patterns

Testing Recommendations

Please verify:

  1. Rate limiting works correctly at IP level
  2. Multiple actors from same IP do not interfere with each other unexpectedly
  3. Middleware defaults are applied correctly
  4. All websocket retry logic still works without RequestContext
  5. Ray ID and request ID are properly propagated through all code paths

Questions for Author

  1. Analytics Removal: What is the plan for the removed analytics? Is there a replacement system or is this data no longer needed?
  2. Actor Rate Limiting: Was per-actor rate limiting intentionally removed? If so, what is the new strategy for preventing one actor from impacting others?
  3. Breaking Changes: Are there external consumers of the guard library that need to be updated?
  4. Migration Path: Should this be split into multiple PRs (analytics removal, API changes, rate limiting changes)?

Architectural Concerns

The PR mixes several types of changes:

  • Refactoring: Removing RequestContext
  • Feature Changes: Removing per-actor rate limiting
  • Breaking API Changes: CustomServeTrait signature changes
  • Infrastructure Changes: Removing ClickHouse integration

Recommendation: Consider splitting into smaller, focused PRs for easier review and safer rollback if issues arise.

Verdict

While the code quality is generally good and the cleanup is worthwhile, this PR makes significant behavioral changes that need careful consideration:

⚠️ Request Changes before merging:

  1. Fix the uppercase/lowercase metric label inconsistency
  2. Fix the unwrap() that could panic in routing function
  3. Add documentation explaining the analytics and rate limiting changes

💭 Recommend Discussion on:

  1. Impact of removing per-actor rate limiting
  2. Analytics data loss and alternatives
  3. Whether this should be split into multiple PRs

The cleanup itself is valuable, but the scope and impact need to be clearly understood by the team before merging.

@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 445bb05 to a084bd6 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 9d72300 to 875a7a4 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from a084bd6 to fab1f2d Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 875a7a4 to 920b1d9 Compare January 14, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants