feat: implement retryable state and preserve message context on failure #161

erskingardner · 2026-01-22T17:23:24Z

Summary

This PR implements better handling for retryable message states and ensures message context (like message_event_id) is preserved when messages fail processing.

Changes

Updated process_message in mdk-core to look up existing records and preserve message_event_id, epoch, and mls_group_id when creating a Failed state record.
Added logic to allow reprocessing of messages marked as Retryable.
Updated storage traits and implementations to support these state transitions.
Fixed tests to align with the new behavior.

Reasoning

Previously, failing messages might lose context about their origin, making it difficult to retry them correctly or correlate them with the original send event. This change ensures that even if processing fails, we keep the link to the original message event.

This PR implements message context preservation for failed message processing and introduces a new Retryable message state to the MDK protocol. When message processing fails, the code now retains the original message_event_id, epoch, and mls_group_id so that retries and correlation with original send events remain possible. This enables the system to reprocess messages marked as Retryable without losing critical metadata.

What changed:

Core message processing logic (mdk-core) now preserves message context by fetching and reusing existing processed message records when failures occur across various error paths (decryption failures, epoch mismatches, group ID mismatches, use-after-eviction, etc.).
Retryable handling is enhanced to attempt recovery using cached message content when messages are in Retryable state, updating both Message and ProcessedMessage records.
New Retryable variant added to ProcessedMessageState enum with corresponding string serialization support (mdk-storage-traits).
Storage implementations updated: mdk-memory-storage and mdk-sqlite-storage both implement new mark_processed_message_retryable() method to transition failed messages to Retryable state while preserving failure context.
New NotFound error variant added to MessageError enum for cases where a message does not exist or is not in the expected state (mdk-storage-traits).
Tests expanded to verify that processing failures preserve original message identifiers and validate retryable state transitions.

API surface:

New public enum variant: ProcessedMessageState::Retryable (mdk-storage-traits).
New trait method: MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> (mdk-storage-traits).
New error variant: MessageError::NotFound for missing or incorrectly-state messages (mdk-storage-traits).

Testing:

Tests added to verify that saving a failed processed message preserves the original message_event_id, epoch, and mls_group_id.
Tests validate that mark_processed_message_retryable() succeeds only for messages in Failed state and returns NotFound for missing or non-Failed messages.
Tests verify retryable message behavior and context preservation across failure and recovery paths.

coderabbitai · 2026-01-22T17:23:45Z

📝 Walkthrough

Walkthrough

This PR introduces a Retryable state for processed messages and enhances message failure handling to preserve contextual metadata (message_event_id, mls_group_id, epoch) across error paths and recovery scenarios.

Changes

Cohort / File(s)	Summary
Storage Trait Definitions `crates/mdk-storage-traits/src/messages/types.rs`, `crates/mdk-storage-traits/src/messages/mod.rs`, `crates/mdk-storage-traits/src/messages/error.rs`	Added new `ProcessedMessageState::Retryable` variant with string serialization (as_str, FromStr, JSON serde); added new trait method `mark_processed_message_retryable(event_id)` for transitioning failed messages to retryable; added new error variant `MessageError::NotFound`
Storage Implementation `crates/mdk-memory-storage/src/messages.rs`, `crates/mdk-sqlite-storage/src/messages.rs`	Implemented `mark_processed_message_retryable()` method on both memory and SQLite storage backends; updates failed processed messages to retryable state and preserves failure_reason
Core Message Processing `crates/mdk-core/src/messages.rs`	Enhanced failure handling to preserve message context (message_event_id, mls_group_id, epoch) when transitioning to Failed state; added Retryable state handling to recover via cached message content; added tests validating context preservation across failure/recovery paths

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Fix: Persist failed message processing state to prevent DoS via repeated reprocessing #116 — Adds early persistence and deduplication of failed processed messages, providing the foundation that this PR builds upon for context preservation.
Fix: Return Unprocessable instead of error for previously failed messages #156 — Modifies failure path handling and state propagation for previously-failed messages in core message processing.

Suggested labels

storage, breaking-change

Suggested reviewers

dannym-arx
mubarakcoded
jgmontoya

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately and concisely summarizes the main changes: implementing a Retryable state and preserving message context during failures.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
No Sensitive Identifier Leakage	✅ Passed	Pull request does not introduce leakage of sensitive identifiers in tracing macros, format strings, or Debug implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-22T17:26:57Z

✅ Coverage: 88.74% → 89.99% (+1.25%)

- Preserve message_event_id, epoch, and mls_group_id when transitioning to Failed state - Allow reprocessing of messages in Retryable state - Update message processing logic to handle retry scenarios - Fix message state persistence in storage implementations

github-actions · 2026-01-22T17:33:56Z

✅ Coverage: 88.74% → 90.68% (+1.94%)

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/mdk-core/src/messages.rs (1)

1787-1804: Preserve epoch/mls_group_id from the existing record when transitioning to Failed.
Right now only message_event_id is preserved; epoch and mls_group_id are overwritten with the current group values. For wrong-epoch/mismatch paths this records the wrong context and defeats the preservation goal. Consider reusing existing values with fallback to the current group.

💡 Suggested pattern (apply to the failure branches above)

-                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let mls_group_id = existing_record
+                    .as_ref()
+                    .and_then(|r| r.mls_group_id.clone())
+                    .or_else(|| Some(group.mls_group_id.clone()));
+                let epoch = existing_record
+                    .as_ref()
+                    .and_then(|r| r.epoch)
+                    .or_else(|| Some(group.epoch));
...
-                    epoch: Some(group.epoch),
-                    mls_group_id: Some(group.mls_group_id.clone()),
+                    epoch,
+                    mls_group_id,

Also applies to: 1839-1856, 1868-1885, 1897-1914, 1939-1956

🤖 Fix all issues with AI agents

In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1656: When marking a Retryable processed_message as
MessageState::Processed (in the retry branch around
processed_message.message_event_id handling), also clear/overwrite its prior
failure metadata: set processed_message.failure_reason to None (or empty) and
update processed_message.processed_at to the current timestamp before calling
self.storage().save_processed_message(processed_message.clone()). Likewise
ensure the corresponding stored message state is updated to
MessageState::Processed (already done via message.state) and persisted with
save_message; use the same timestamp source you use elsewhere (e.g., Utc::now())
so downstream consumers do not see stale failure_reason/processed_at when
ProcessedMessageState::Processed is stored.

🧹 Nitpick comments (1)

crates/mdk-memory-storage/src/messages.rs (1)

751-847: Move use statements to module scope (guideline).
In-function imports at Lines 753, 800, and 814 violate the “use statements at top of scope” rule. Consider hoisting them to the test module level.

♻️ Proposed refactor

 #[cfg(test)]
 mod tests {
     use std::collections::BTreeSet;

     use mdk_storage_traits::groups::GroupStorage;
     use mdk_storage_traits::groups::types::{Group, GroupState};
+    use mdk_storage_traits::messages::error::MessageError;
+    use mdk_storage_traits::messages::types::ProcessedMessage;
     use nostr::Keys;

     use super::*;
@@
     #[test]
     fn test_mark_processed_message_retryable() {
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_nonexistent_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_non_failed_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();

As per coding guidelines, keep `use` statements at the top of their scope.

coderabbitai · 2026-01-22T17:41:46Z

crates/mdk-core/src/messages.rs

+                    message_types::ProcessedMessageState::Retryable => {
+                        // Retryable messages are ones that previously failed due to wrong epoch keys
+                        // but have been marked for retry after a rollback. For our own messages,
+                        // we should have cached content - try to retrieve and return it.
+                        tracing::debug!(target: "mdk_core::messages::process_message", "Retrying own message after rollback");
+
+                        if let Some(message_event_id) = processed_message.message_event_id
+                            && let Ok(Some(mut message)) =
+                                self.get_message(&group.mls_group_id, &message_event_id)
+                        {
+                            // Update states to mark as successfully processed
+                            message.state = message_types::MessageState::Processed;
+                            self.storage()
+                                .save_message(message)
+                                .map_err(|e| Error::Message(e.to_string()))?;
+
+                            processed_message.state =
+                                message_types::ProcessedMessageState::Processed;
+                            self.storage()
+                                .save_processed_message(processed_message.clone())
+                                .map_err(|e| Error::Message(e.to_string()))?;
+
+                            tracing::info!(
+                                target: "mdk_core::messages::process_message",
+                                "Successfully retried own cached message after rollback"
+                            );
+                            let message = self
+                                .get_message(&group.mls_group_id, &message_event_id)?
+                                .ok_or(Error::MessageNotFound)?;
+                            return Ok(MessageProcessingResult::ApplicationMessage(message));
+                        }
+
+                        // No cached content available - this shouldn't happen for our own messages,
+                        // but if it does, we can't recover
+                        tracing::warn!(
+                            target: "mdk_core::messages::process_message",
+                            "Retryable own message has no cached content - cannot recover"
+                        );
+                        Ok(MessageProcessingResult::Unprocessable {
+                            mls_group_id: group.mls_group_id.clone(),
+                        })
+                    }


⚠️ Potential issue | 🟡 Minor

Clear failure metadata when a retry succeeds.
When a Retryable message is marked Processed, the prior failure_reason is left intact (and processed_at stays stale). That leaves contradictory state for downstream consumers.

💡 Suggested fix

- processed_message.state = - message_types::ProcessedMessageState::Processed; + processed_message.state = + message_types::ProcessedMessageState::Processed; + processed_message.failure_reason = None; + processed_message.processed_at = Timestamp::now();

🤖 Prompt for AI Agents

In `@crates/mdk-core/src/messages.rs` around lines 1615 - 1656, When marking a Retryable processed_message as MessageState::Processed (in the retry branch around processed_message.message_event_id handling), also clear/overwrite its prior failure metadata: set processed_message.failure_reason to None (or empty) and update processed_message.processed_at to the current timestamp before calling self.storage().save_processed_message(processed_message.clone()). Likewise ensure the corresponding stored message state is updated to MessageState::Processed (already done via message.state) and persisted with save_message; use the same timestamp source you use elsewhere (e.g., Utc::now()) so downstream consumers do not see stale failure_reason/processed_at when ProcessedMessageState::Processed is stored.

erskingardner force-pushed the fix-retryable-state branch from fd4fc49 to 067ca95 Compare January 22, 2026 17:31

coderabbitai bot requested changes Jan 22, 2026

View reviewed changes

erskingardner marked this pull request as draft January 22, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement retryable state and preserve message context on failure #161

feat: implement retryable state and preserve message context on failure #161

Uh oh!

erskingardner commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: implement retryable state and preserve message context on failure #161

Are you sure you want to change the base?

feat: implement retryable state and preserve message context on failure #161

Uh oh!

Conversation

erskingardner commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Reasoning

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions bot commented Jan 22, 2026

✅ Coverage: 88.74% → 89.99% (+1.25%)

Uh oh!

github-actions bot commented Jan 22, 2026

✅ Coverage: 88.74% → 90.68% (+1.94%)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erskingardner commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading