Skip to content

Issue #4425 : Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries #4425#4560

Open
neerajchowdary889 wants to merge 11 commits intorestatedev:mainfrom
neerajchowdary889:issue-#4425
Open

Issue #4425 : Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries #4425#4560
neerajchowdary889 wants to merge 11 commits intorestatedev:mainfrom
neerajchowdary889:issue-#4425

Conversation

@neerajchowdary889
Copy link
Copy Markdown
Contributor

Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries #4425

issue: #4425

Problem

When an invocation is killed — either via the Admin API or after exhausting all retries
with onMaxAttempts: 'kill' — no journal event was written to sys_journal_events.

This caused two problems:

  1. Admin API kill: no event at all, so the UI had no way to show the invocation was
    deliberately killed.
  2. Kill-after-max-retries: the final error stored was the raw service error (e.g.
    HTTP 500). There was no way to distinguish a genuine failure from a kill.

Closes #4425

Solution

Add a Killed journal event type written in both kill paths.

  • Admin API kill (kill_invoked_invocation, kill_suspended_or_paused_invocation):
    writes KilledEvent { last_failure: None } after the invocation is ended.
  • Kill-after-max-retries (new KilledAfterMaxAttempts invoker effect): the invoker
    packages the last transient error into a KilledEvent { last_failure: Some(...) } and
    sends it as a new effect kind. The partition processor writes it after end_invocation.
    The final invocation result is now always KILLED_INVOCATION_ERROR (code ABORTED,
    message "killed") instead of the raw service error.

The event is always written after end_invocation returns, so it survives the
do_drop_journal cleanup that runs inside end_invocation.

Changes

File Change
crates/types/protobuf/restate/journal_events.proto Add KilledEvent proto message
crates/types/src/journal_events/mod.rs Add Killed = 3 to EventType, Killed(KilledEvent) to Event, KilledEvent struct
crates/types/src/journal_events/raw.rs Encode/decode and proto conversions for KilledEvent
crates/invoker-api/src/effects.rs Add KilledAfterMaxAttempts { killed_event: RawEvent } to EffectKind
crates/invoker-impl/src/lib.rs Send KilledAfterMaxAttempts instead of Failed on OnTaskError::Kill
crates/worker/src/partition/state_machine/mod.rs Handle KilledAfterMaxAttempts effect; write KilledEvent in both kill functions
crates/storage-query-datafusion/src/journal_events/schema.rs Document Killed as a valid event_type value
crates/worker/.../tests/kill_cancel.rs Two new tests covering both kill paths
release-notes/unreleased/4425-killed-journal-event.md Release note

Querying killed invocations

SELECT id, event_json
FROM sys_journal_events
WHERE event_type = 'Killed'

The event_json column contains:

{ "ty": "Killed", "last_failure": { "error_code": 500, "error_message": "..." } }

last_failure is absent when the invocation was killed via the Admin API.

Breaking change

For onMaxAttempts: 'kill' invocations, the final stored error changes from the raw
service error to KILLED_INVOCATION_ERROR (code ABORTED, message "killed"). Code
that inspects the error payload to detect kills should migrate to querying
sys_journal_events for event_type = 'Killed'.

Testing

cargo nextest run -p restate-worker kill_cancel
cargo nextest run -p restate-invoker-impl

All 11 kill/cancel tests pass, all 31 invoker-impl tests pass.

Introduce a new EffectKind variant, KilledAfterMaxAttempts, to differentiate between invocation failures and kills after exceeding retry attempts. This includes the addition of a KilledEvent structure to capture the last transient error before being killed. Update relevant logic in the state machine and journal event handling to accommodate this new event type.
Refactor the handling of Killed invocations to include a new KilledEvent structure that captures the last transient error. Update the state machine and journal event tests to ensure correct behavior for both immediate kills and those triggered after exceeding maximum attempts. This change improves the clarity and reliability of invocation termination processes.
Consolidate the handling of killed invocations by moving the end_invocation call to occur after the journal event is written. This change ensures that journal cleanup does not inadvertently delete important entries. The update improves the clarity and reliability of the invocation termination logic.
…r max attempts

Refactor the test to ensure that after exceeding maximum attempts, the invocation status is correctly marked as inactive rather than completed with a killed error. This change enhances the clarity of the test's intent and improves the reliability of the invocation state verification.
…clean up error imports in mod.rs. This enhances code clarity and maintains consistency in error handling across the test suite.
…e release notes to clarify that invocations killed after max attempts will now generate a `Killed` journal event, improving observability and distinguishing between genuine failures and deliberate kills. Include migration guidance for users to adapt to the new event handling.
…om googletest. This change enhances the test suite's assertion capabilities, ensuring more robust verification of invocation states.
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

neerajchowdary889 and others added 3 commits April 3, 2026 19:54
… of invocation effects. This change improves code readability and maintains consistent formatting within the state machine logic.
Fix indentation in StateMachineApplyContext to ensure proper handling…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries

1 participant