Skip to content

Invoker memory management: Prevent memory ballooning during node restart/leader failover #4311

@tillrohrmann

Description

@tillrohrmann

Problem Statement

When a Restate node restarts (or gains new partition leaders during failover), all currently invoked invocations on those partitions will be re-invoked. This can cause significant memory spikes ("memory ballooning") due to several factors:

  1. Journal stream materialization - Journals are fully materialized in memory when reading from storage, causing memory spikes proportional to journal sizes
  2. No global memory budget for the invoker - The invoker has no mechanism to limit overall memory usage; it only has a concurrency limit which doesn't account for actual memory consumption
  3. HTTP/2 connection buffer memory - The invoker does not respect the per stream window size and blindly pushes records into a bounded channel
  4. No service endpoint backpressure - Service endpoints cannot report their available capacity back to Restate, potentially causing both Restate and the endpoint to become overwhelmed

Current State

The invoker currently has these memory-related controls:

  • concurrent_invocations_limit (default: 1000) - Limits concurrent invocations but doesn't account for memory
  • in_memory_queue_length_limit (default: 66,049) - Per-partition limit before spilling to disk
  • message_size_limit - Limits individual journal message sizes
  • Token bucket throttling - Rate limiting for invocations/actions (optional)

However, these controls are insufficient:

  • Concurrency limits don't correlate with memory usage (some invocations use MBs, others use KBs)
  • Per-partition queue limits can multiply across many partitions (#partitions × limit × size ≈ potentially GBs)
  • HTTP/2 adaptive window is enabled, but there's no per-connection memory cap

Proposed Solution

This umbrella issue tracks the work needed to implement proper memory management for the invoker. The goal is to allow configuring a memory budget that the invoker respects when scheduling invocations.

Sub-issues / Tasks

1. Avoid journal stream materialization (#275)

Keep the storage transaction open inside InvocationTask while reading the journal stream, avoiding full materialization in memory.

Related: #275

2. Implement memory-aware invocation scheduling

3. HTTP/2 connection memory configuration

  • Configure proper connection and stream window sizes for the HttpClient if possible
  • If not possible, consider alternative approaches (e.g., limiting concurrent streams per connection more aggressively)
  • Let InvocationTask respect the limits

4. Service endpoint capacity reporting (protocol extension)

  • Extend the service protocol to allow endpoints to report their available capacity
  • Restate can use this information to pace invocations to endpoints
  • This helps prevent overwhelming service endpoints during bursts

Success Criteria

  • A restarting node with 100k+ pending invocations with significant state and journals should not OOM
  • Memory usage during restart should be bounded and configurable
  • Service endpoints should not be overwhelmed during restart scenarios
  • Observable metrics for memory usage and throttling behavior

Related Issues

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions