Skip to content

Conversation

@erskingardner
Copy link
Member

@erskingardner erskingardner commented Jan 22, 2026

Summary

This PR implements better handling for retryable message states and ensures message context (like message_event_id) is preserved when messages fail processing.

Changes

  • Updated process_message in mdk-core to look up existing records and preserve message_event_id, epoch, and mls_group_id when creating a Failed state record.
  • Added logic to allow reprocessing of messages marked as Retryable.
  • Updated storage traits and implementations to support these state transitions.
  • Fixed tests to align with the new behavior.

Reasoning

Previously, failing messages might lose context about their origin, making it difficult to retry them correctly or correlate them with the original send event. This change ensures that even if processing fails, we keep the link to the original message event.

This PR implements message context preservation for failed message processing and introduces a new Retryable message state to the MDK protocol. When message processing fails, the code now retains the original message_event_id, epoch, and mls_group_id so that retries and correlation with original send events remain possible. This enables the system to reprocess messages marked as Retryable without losing critical metadata.

What changed:

  • Core message processing logic (mdk-core) now preserves message context by fetching and reusing existing processed message records when failures occur across various error paths (decryption failures, epoch mismatches, group ID mismatches, use-after-eviction, etc.).
  • Retryable handling is enhanced to attempt recovery using cached message content when messages are in Retryable state, updating both Message and ProcessedMessage records.
  • New Retryable variant added to ProcessedMessageState enum with corresponding string serialization support (mdk-storage-traits).
  • Storage implementations updated: mdk-memory-storage and mdk-sqlite-storage both implement new mark_processed_message_retryable() method to transition failed messages to Retryable state while preserving failure context.
  • New NotFound error variant added to MessageError enum for cases where a message does not exist or is not in the expected state (mdk-storage-traits).
  • Tests expanded to verify that processing failures preserve original message identifiers and validate retryable state transitions.

API surface:

  • New public enum variant: ProcessedMessageState::Retryable (mdk-storage-traits).
  • New trait method: MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> (mdk-storage-traits).
  • New error variant: MessageError::NotFound for missing or incorrectly-state messages (mdk-storage-traits).

Testing:

  • Tests added to verify that saving a failed processed message preserves the original message_event_id, epoch, and mls_group_id.
  • Tests validate that mark_processed_message_retryable() succeeds only for messages in Failed state and returns NotFound for missing or non-Failed messages.
  • Tests verify retryable message behavior and context preservation across failure and recovery paths.

@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

This PR introduces a Retryable state for processed messages and enhances message failure handling to preserve contextual metadata (message_event_id, mls_group_id, epoch) across error paths and recovery scenarios.

Changes

Cohort / File(s) Summary
Storage Trait Definitions
crates/mdk-storage-traits/src/messages/types.rs, crates/mdk-storage-traits/src/messages/mod.rs, crates/mdk-storage-traits/src/messages/error.rs
Added new ProcessedMessageState::Retryable variant with string serialization (as_str, FromStr, JSON serde); added new trait method mark_processed_message_retryable(event_id) for transitioning failed messages to retryable; added new error variant MessageError::NotFound
Storage Implementation
crates/mdk-memory-storage/src/messages.rs, crates/mdk-sqlite-storage/src/messages.rs
Implemented mark_processed_message_retryable() method on both memory and SQLite storage backends; updates failed processed messages to retryable state and preserves failure_reason
Core Message Processing
crates/mdk-core/src/messages.rs
Enhanced failure handling to preserve message context (message_event_id, mls_group_id, epoch) when transitioning to Failed state; added Retryable state handling to recover via cached message content; added tests validating context preservation across failure/recovery paths

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

storage, breaking-change

Suggested reviewers

  • dannym-arx
  • mubarakcoded
  • jgmontoya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely summarizes the main changes: implementing a Retryable state and preserving message context during failures.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
No Sensitive Identifier Leakage ✅ Passed Pull request does not introduce leakage of sensitive identifiers in tracing macros, format strings, or Debug implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

✅ Coverage: 88.74% → 89.99% (+1.25%)

- Preserve message_event_id, epoch, and mls_group_id when transitioning to Failed state
- Allow reprocessing of messages in Retryable state
- Update message processing logic to handle retry scenarios
- Fix message state persistence in storage implementations
@github-actions
Copy link

✅ Coverage: 88.74% → 90.68% (+1.94%)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/mdk-core/src/messages.rs (1)

1787-1804: Preserve epoch/mls_group_id from the existing record when transitioning to Failed.
Right now only message_event_id is preserved; epoch and mls_group_id are overwritten with the current group values. For wrong-epoch/mismatch paths this records the wrong context and defeats the preservation goal. Consider reusing existing values with fallback to the current group.

💡 Suggested pattern (apply to the failure branches above)
-                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let mls_group_id = existing_record
+                    .as_ref()
+                    .and_then(|r| r.mls_group_id.clone())
+                    .or_else(|| Some(group.mls_group_id.clone()));
+                let epoch = existing_record
+                    .as_ref()
+                    .and_then(|r| r.epoch)
+                    .or_else(|| Some(group.epoch));
...
-                    epoch: Some(group.epoch),
-                    mls_group_id: Some(group.mls_group_id.clone()),
+                    epoch,
+                    mls_group_id,

Also applies to: 1839-1856, 1868-1885, 1897-1914, 1939-1956

🤖 Fix all issues with AI agents
In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1656: When marking a Retryable processed_message as
MessageState::Processed (in the retry branch around
processed_message.message_event_id handling), also clear/overwrite its prior
failure metadata: set processed_message.failure_reason to None (or empty) and
update processed_message.processed_at to the current timestamp before calling
self.storage().save_processed_message(processed_message.clone()). Likewise
ensure the corresponding stored message state is updated to
MessageState::Processed (already done via message.state) and persisted with
save_message; use the same timestamp source you use elsewhere (e.g., Utc::now())
so downstream consumers do not see stale failure_reason/processed_at when
ProcessedMessageState::Processed is stored.
🧹 Nitpick comments (1)
crates/mdk-memory-storage/src/messages.rs (1)

751-847: Move use statements to module scope (guideline).
In-function imports at Lines 753, 800, and 814 violate the “use statements at top of scope” rule. Consider hoisting them to the test module level.

♻️ Proposed refactor
 #[cfg(test)]
 mod tests {
     use std::collections::BTreeSet;

     use mdk_storage_traits::groups::GroupStorage;
     use mdk_storage_traits::groups::types::{Group, GroupState};
+    use mdk_storage_traits::messages::error::MessageError;
+    use mdk_storage_traits::messages::types::ProcessedMessage;
     use nostr::Keys;

     use super::*;
@@
     #[test]
     fn test_mark_processed_message_retryable() {
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_nonexistent_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_non_failed_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
As per coding guidelines, keep `use` statements at the top of their scope.

Comment on lines +1615 to +1656
message_types::ProcessedMessageState::Retryable => {
// Retryable messages are ones that previously failed due to wrong epoch keys
// but have been marked for retry after a rollback. For our own messages,
// we should have cached content - try to retrieve and return it.
tracing::debug!(target: "mdk_core::messages::process_message", "Retrying own message after rollback");

if let Some(message_event_id) = processed_message.message_event_id
&& let Ok(Some(mut message)) =
self.get_message(&group.mls_group_id, &message_event_id)
{
// Update states to mark as successfully processed
message.state = message_types::MessageState::Processed;
self.storage()
.save_message(message)
.map_err(|e| Error::Message(e.to_string()))?;

processed_message.state =
message_types::ProcessedMessageState::Processed;
self.storage()
.save_processed_message(processed_message.clone())
.map_err(|e| Error::Message(e.to_string()))?;

tracing::info!(
target: "mdk_core::messages::process_message",
"Successfully retried own cached message after rollback"
);
let message = self
.get_message(&group.mls_group_id, &message_event_id)?
.ok_or(Error::MessageNotFound)?;
return Ok(MessageProcessingResult::ApplicationMessage(message));
}

// No cached content available - this shouldn't happen for our own messages,
// but if it does, we can't recover
tracing::warn!(
target: "mdk_core::messages::process_message",
"Retryable own message has no cached content - cannot recover"
);
Ok(MessageProcessingResult::Unprocessable {
mls_group_id: group.mls_group_id.clone(),
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clear failure metadata when a retry succeeds.
When a Retryable message is marked Processed, the prior failure_reason is left intact (and processed_at stays stale). That leaves contradictory state for downstream consumers.

💡 Suggested fix
-                            processed_message.state =
-                                message_types::ProcessedMessageState::Processed;
+                            processed_message.state =
+                                message_types::ProcessedMessageState::Processed;
+                            processed_message.failure_reason = None;
+                            processed_message.processed_at = Timestamp::now();
🤖 Prompt for AI Agents
In `@crates/mdk-core/src/messages.rs` around lines 1615 - 1656, When marking a
Retryable processed_message as MessageState::Processed (in the retry branch
around processed_message.message_event_id handling), also clear/overwrite its
prior failure metadata: set processed_message.failure_reason to None (or empty)
and update processed_message.processed_at to the current timestamp before
calling self.storage().save_processed_message(processed_message.clone()).
Likewise ensure the corresponding stored message state is updated to
MessageState::Processed (already done via message.state) and persisted with
save_message; use the same timestamp source you use elsewhere (e.g., Utc::now())
so downstream consumers do not see stale failure_reason/processed_at when
ProcessedMessageState::Processed is stored.

@erskingardner erskingardner marked this pull request as draft January 22, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants