-
Notifications
You must be signed in to change notification settings - Fork 126
Open
2 / 52 of 5 issues completedOpen
2 / 52 of 5 issues completed
Copy link
Labels
Milestone
Description
Problem Statement
When a Restate node restarts (or gains new partition leaders during failover), all currently invoked invocations on those partitions will be re-invoked. This can cause significant memory spikes ("memory ballooning") due to several factors:
- Journal stream materialization - Journals are fully materialized in memory when reading from storage, causing memory spikes proportional to journal sizes
- No global memory budget for the invoker - The invoker has no mechanism to limit overall memory usage; it only has a concurrency limit which doesn't account for actual memory consumption
- HTTP/2 connection buffer memory - The invoker does not respect the per stream window size and blindly pushes records into a bounded channel
- No service endpoint backpressure - Service endpoints cannot report their available capacity back to Restate, potentially causing both Restate and the endpoint to become overwhelmed
Current State
The invoker currently has these memory-related controls:
concurrent_invocations_limit(default: 1000) - Limits concurrent invocations but doesn't account for memoryin_memory_queue_length_limit(default: 66,049) - Per-partition limit before spilling to diskmessage_size_limit- Limits individual journal message sizes- Token bucket throttling - Rate limiting for invocations/actions (optional)
However, these controls are insufficient:
- Concurrency limits don't correlate with memory usage (some invocations use MBs, others use KBs)
- Per-partition queue limits can multiply across many partitions (#partitions × limit × size ≈ potentially GBs)
- HTTP/2 adaptive window is enabled, but there's no per-connection memory cap
Proposed Solution
This umbrella issue tracks the work needed to implement proper memory management for the invoker. The goal is to allow configuring a memory budget that the invoker respects when scheduling invocations.
Sub-issues / Tasks
1. Avoid journal stream materialization (#275)
Keep the storage transaction open inside InvocationTask while reading the journal stream, avoiding full materialization in memory.
Related: #275
2. Implement memory-aware invocation scheduling
- Track estimated memory usage per invocation task
- Implement a memory budget/quota that controls when new invocations can be started
- Consider memory consumption from: journal data, HTTP buffers, response data
- Integrate with potential global node memory limit (Make InvokerOptions.in_memory_queue_length_limit respect global memory limit of a node #2402)
3. HTTP/2 connection memory configuration
- Configure proper connection and stream window sizes for the
HttpClientif possible - If not possible, consider alternative approaches (e.g., limiting concurrent streams per connection more aggressively)
- Let
InvocationTaskrespect the limits
4. Service endpoint capacity reporting (protocol extension)
- Extend the service protocol to allow endpoints to report their available capacity
- Restate can use this information to pace invocations to endpoints
- This helps prevent overwhelming service endpoints during bursts
Success Criteria
- A restarting node with 100k+ pending invocations with significant state and journals should not OOM
- Memory usage during restart should be bounded and configurable
- Service endpoints should not be overwhelmed during restart scenarios
- Observable metrics for memory usage and throttling behavior
Related Issues
- Avoid journal stream materialization by keeping storage transaction open #275 - Avoid journal stream materialization by keeping storage transaction open
- Make InvokerOptions.in_memory_queue_length_limit respect global memory limit of a node #2402 - Make InvokerOptions.in_memory_queue_length_limit respect global memory limit
Reactions are currently unavailable