Skip to content

fix: strategy for repairing broken mls groups - WPB-23166#4258

Open
johnxnguyen wants to merge 15 commits intofix/commit-pending-proposals-WPB-23159from
fix/decryption-service-wpb-23166
Open

fix: strategy for repairing broken mls groups - WPB-23166#4258
johnxnguyen wants to merge 15 commits intofix/commit-pending-proposals-WPB-23159from
fix/decryption-service-wpb-23166

Conversation

@johnxnguyen
Copy link
Collaborator

@johnxnguyen johnxnguyen commented Feb 3, 2026

TaskWPB-23166 [iOS] Cleanup recovery strategy for mlsService

Issue

When a broken mls group is detected ("wrong epoch" error when decrypting a message), the MLS service will attempt to perform an incremental sync and then repair the group. Triggering syncs as a recovery mechanism is no longer correct, instead the group ID should be noted and the group then repair explicitly later in time.

This PR removes the automatic sync behavior and introduces the handling to track the broken group id and schedule a repair work item.

Testing

?


Checklist

  • Title contains a reference JIRA issue number like [WPB-XXX].
  • Description is filled and free of optional paragraphs.
  • Adds/updates automated tests.

UI accessibility checklist

If your PR includes UI changes, please utilize this checklist:

  • Make sure you use the API for UI elements that support large fonts.
  • All colors are taken from WireDesign.ColorTheme or constructed using WireDesign.BaseColorPalette.
  • New UI elements have Accessibility strings for VoiceOver.

@netbe netbe self-assigned this Feb 4, 2026
@netbe netbe self-requested a review February 4, 2026 16:30
@netbe netbe changed the base branch from release/cycle-4.15 to fix/commit-pending-proposals-WPB-23159 February 5, 2026 10:30
@netbe netbe requested a review from Copilot February 5, 2026 13:16
@netbe netbe marked this pull request as ready for review February 5, 2026 13:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the strategy for repairing broken MLS (Messaging Layer Security) groups by removing the synchronous delegate pattern and replacing it with a reactive publisher-based approach using Combine and a work item system.

Changes:

  • Removed MLSSyncDelegate protocol and replaced synchronous repair triggering with a reactive PassthroughSubject that notifies about broken groups detected during live sync
  • Introduced RepairBrokenMLSGroupsItem work item to handle repairs asynchronously through the work agent system
  • Simplified MLSService.fetchAndRepairGroup by removing the incremental sync parameter and automatic repair on decrypt error

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
wire-ios-data-model/Source/MLS/MLSSyncDelegate.swift Deleted protocol file that is no longer needed
wire-ios-data-model/Source/MLS/MLSService.swift Removed sync delegate, simplified repair methods, improved logging with safeAttributes helper
wire-ios-data-model/Source/MLS/MLSServiceInterface.swift Removed setSyncDelegate method and simplified fetchAndRepairGroup signature
wire-ios-data-model/Source/MLS/MLSActionExecutor.swift Made safeAttributes extension public for use in MLSService
wire-ios-sync-engine/Source/Synchronization/SyncAgent.swift Removed MLSSyncDelegate extension implementation
wire-ios-sync-engine/Source/UserSession/ZMUserSession/ZMUserSession.swift Added liveMLSBrokenGroupsCancellable and setup call, removed setSyncDelegate call
WireDomain/Sources/WireDomain/Synchronization/IncrementalSync.swift Added liveBrokenGroupSubject parameter and sends broken group notifications
WireDomain/Sources/WireDomain/Synchronization/IncrementalSyncV2.swift Added liveBrokenGroupSubject parameter, sends broken group notifications, adds DeveloperFlag for debugging
WireDomain/Sources/WireDomain/Synchronization/MLSGroupRepairAgent.swift Removed shouldPerformIncrementalSync parameter from fetchAndRepairGroup call
WireDomain/Sources/WireDomain/WorkAgent/WorkItem/RepairBrokenMLSGroupsItem.swift New work item for repairing broken MLS groups asynchronously
WireDomain/Sources/WireDomain/Components/ClientSessionComponent.swift Added liveBrokenGroupSubject and setupLiveMLSBrokenGroups method to trigger repairs via work agent
wire-ios-data-model/Tests/MLS/MLSServiceTests.swift Removed tests for old sync delegate behavior and automatic repair on decrypt error
wire-ios-sync-engine/Tests/Source/UserSession/ZMUserSessionTestsBase.swift Removed mock setup for deleted setSyncDelegate method
WireDomain/Tests/WireDomainTests/Synchronization/* Updated test mocks and added new tests for live broken group notification
wire-ios-data-model/WireDataModel.xcodeproj/project.pbxproj Removed MLSSyncDelegate.swift from project file
wire-ios-data-model/Support/Sourcery/generated/AutoMockable.generated.swift Removed generated mock for MLSSyncDelegate

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1194 to 1199
logger
.info(
"cannot resolve conversation epoch \(String(describing: error)) for (\(groupID.safeForLoggingDescription))"
.error(
"cannot resolve conversation epoch \(String(describing: error))",
attributes: groupID.safeAttributes,
[.conversationId: conversationID?.safeForLoggingDescription ?? "<nil>"]
)
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax used here for passing multiple LogAttributes appears unusual. It seems like the intention is to merge two attribute dictionaries: groupID.safeAttributes and an additional dictionary with conversationId.

Verify that this logging call works as expected. The typical pattern for merging LogAttributes would be to combine them into a single dictionary before passing them to the logger, or use a dictionary merging operator if the logger doesn't support variadic attributes.

Copilot uses AI. Check for mistakes.
Comment on lines 888 to 889
Task { [weak self] in
await self?.workAgent.submitItem(item)
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self is already captured as weak in the outer closure on line 884. Capturing it again as weak in the nested Task block is redundant because the Task is not creating a strong reference cycle—it's a detached concurrent task that will complete independently.

Consider simplifying to just use self without the capture list in the Task block, since the outer closure already prevents a retain cycle.

Suggested change
Task { [weak self] in
await self?.workAgent.submitItem(item)
Task {
await self.workAgent.submitItem(item)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Test Results

    4 files    620 suites   6m 34s ⏱️
4 961 tests 4 960 ✅ 0 💤 1 ❌
4 962 runs  4 962 ✅ 0 💤 0 ❌

For more details on these failures, see this check.

Results for commit 3769fba.

♻️ This comment has been updated with latest results.

Summary: workflow run #21752119172
Allure report (download zip): html-report-27641-fix_decryption-service-wpb-23166

@netbe netbe requested a review from samwyndham February 5, 2026 14:42
Comment on lines +327 to +334
if DeveloperFlag.ignoreIncomingEvents.isOn {
logger.warn(
"ignore incoming events",
attributes: .incrementalSyncV3 + [.eventEnvelopeID: envelope.id]
)
await acknowledgeUntilEnvelope(envelope, through: pushChannel, batchSize: 1)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What is this for? Testing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was there before in legacy sync. It skips events so when you set it off again you're out of sync in mls conversations

// MARK: High level syncs

public private(set) lazy var syncStateSubject = CurrentValueSubject<SyncState, Never>(.idle)
private(set) lazy var liveBrokenGroupSubject = PassthroughSubject<Set<String>, Never>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity check: is there is any issue with this being a passthrough subject vs a current value subject 🤔. Does it matter that new subscribers won't get the existing IDs?

I assume that the setup is called before we have any broken groups have been discovered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we only want to get live events so the current value does not matter, it's just a signal to repair groups when the websocket is opened

Copy link
Collaborator Author

@johnxnguyen johnxnguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started this PR but @netbe finished it. I can't approve it explicitly but I have reviewed it and do approve.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
13.7% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@netbe netbe requested a review from samwyndham February 6, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants