Issue #4425 : Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries #4425#4560
Open
neerajchowdary889 wants to merge 11 commits intorestatedev:mainfrom
Open
Conversation
Introduce a new EffectKind variant, KilledAfterMaxAttempts, to differentiate between invocation failures and kills after exceeding retry attempts. This includes the addition of a KilledEvent structure to capture the last transient error before being killed. Update relevant logic in the state machine and journal event handling to accommodate this new event type.
Refactor the handling of Killed invocations to include a new KilledEvent structure that captures the last transient error. Update the state machine and journal event tests to ensure correct behavior for both immediate kills and those triggered after exceeding maximum attempts. This change improves the clarity and reliability of invocation termination processes.
…ons in kill_cancel tests
Consolidate the handling of killed invocations by moving the end_invocation call to occur after the journal event is written. This change ensures that journal cleanup does not inadvertently delete important entries. The update improves the clarity and reliability of the invocation termination logic.
…r max attempts Refactor the test to ensure that after exceeding maximum attempts, the invocation status is correctly marked as inactive rather than completed with a killed error. This change enhances the clarity of the test's intent and improves the reliability of the invocation state verification.
…clean up error imports in mod.rs. This enhances code clarity and maintains consistency in error handling across the test suite.
…e release notes to clarify that invocations killed after max attempts will now generate a `Killed` journal event, improving observability and distinguishing between genuine failures and deliberate kills. Include migration guidance for users to adapt to the new event handling.
…om googletest. This change enhances the test suite's assertion capabilities, ensuring more robust verification of invocation states.
… of invocation effects. This change improves code readability and maintains consistent formatting within the state machine logic.
Fix indentation in StateMachineApplyContext to ensure proper handling…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use a standard 'killed' error with cause when onMaxAttempts: 'kill' exhausts retries #4425
issue: #4425
Problem
When an invocation is killed — either via the Admin API or after exhausting all retries
with
onMaxAttempts: 'kill'— no journal event was written tosys_journal_events.This caused two problems:
deliberately killed.
HTTP 500). There was no way to distinguish a genuine failure from a kill.
Closes #4425
Solution
Add a
Killedjournal event type written in both kill paths.kill_invoked_invocation,kill_suspended_or_paused_invocation):writes
KilledEvent { last_failure: None }after the invocation is ended.KilledAfterMaxAttemptsinvoker effect): the invokerpackages the last transient error into a
KilledEvent { last_failure: Some(...) }andsends it as a new effect kind. The partition processor writes it after
end_invocation.The final invocation result is now always
KILLED_INVOCATION_ERROR(codeABORTED,message
"killed") instead of the raw service error.The event is always written after
end_invocationreturns, so it survives thedo_drop_journalcleanup that runs insideend_invocation.Changes
crates/types/protobuf/restate/journal_events.protoKilledEventproto messagecrates/types/src/journal_events/mod.rsKilled = 3toEventType,Killed(KilledEvent)toEvent,KilledEventstructcrates/types/src/journal_events/raw.rsKilledEventcrates/invoker-api/src/effects.rsKilledAfterMaxAttempts { killed_event: RawEvent }toEffectKindcrates/invoker-impl/src/lib.rsKilledAfterMaxAttemptsinstead ofFailedonOnTaskError::Killcrates/worker/src/partition/state_machine/mod.rsKilledAfterMaxAttemptseffect; writeKilledEventin both kill functionscrates/storage-query-datafusion/src/journal_events/schema.rsKilledas a validevent_typevaluecrates/worker/.../tests/kill_cancel.rsrelease-notes/unreleased/4425-killed-journal-event.mdQuerying killed invocations