fix: strategy for repairing broken mls groups - WPB-23166#4258
fix: strategy for repairing broken mls groups - WPB-23166#4258johnxnguyen wants to merge 15 commits intofix/commit-pending-proposals-WPB-23159from
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the strategy for repairing broken MLS (Messaging Layer Security) groups by removing the synchronous delegate pattern and replacing it with a reactive publisher-based approach using Combine and a work item system.
Changes:
- Removed
MLSSyncDelegateprotocol and replaced synchronous repair triggering with a reactivePassthroughSubjectthat notifies about broken groups detected during live sync - Introduced
RepairBrokenMLSGroupsItemwork item to handle repairs asynchronously through the work agent system - Simplified
MLSService.fetchAndRepairGroupby removing the incremental sync parameter and automatic repair on decrypt error
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| wire-ios-data-model/Source/MLS/MLSSyncDelegate.swift | Deleted protocol file that is no longer needed |
| wire-ios-data-model/Source/MLS/MLSService.swift | Removed sync delegate, simplified repair methods, improved logging with safeAttributes helper |
| wire-ios-data-model/Source/MLS/MLSServiceInterface.swift | Removed setSyncDelegate method and simplified fetchAndRepairGroup signature |
| wire-ios-data-model/Source/MLS/MLSActionExecutor.swift | Made safeAttributes extension public for use in MLSService |
| wire-ios-sync-engine/Source/Synchronization/SyncAgent.swift | Removed MLSSyncDelegate extension implementation |
| wire-ios-sync-engine/Source/UserSession/ZMUserSession/ZMUserSession.swift | Added liveMLSBrokenGroupsCancellable and setup call, removed setSyncDelegate call |
| WireDomain/Sources/WireDomain/Synchronization/IncrementalSync.swift | Added liveBrokenGroupSubject parameter and sends broken group notifications |
| WireDomain/Sources/WireDomain/Synchronization/IncrementalSyncV2.swift | Added liveBrokenGroupSubject parameter, sends broken group notifications, adds DeveloperFlag for debugging |
| WireDomain/Sources/WireDomain/Synchronization/MLSGroupRepairAgent.swift | Removed shouldPerformIncrementalSync parameter from fetchAndRepairGroup call |
| WireDomain/Sources/WireDomain/WorkAgent/WorkItem/RepairBrokenMLSGroupsItem.swift | New work item for repairing broken MLS groups asynchronously |
| WireDomain/Sources/WireDomain/Components/ClientSessionComponent.swift | Added liveBrokenGroupSubject and setupLiveMLSBrokenGroups method to trigger repairs via work agent |
| wire-ios-data-model/Tests/MLS/MLSServiceTests.swift | Removed tests for old sync delegate behavior and automatic repair on decrypt error |
| wire-ios-sync-engine/Tests/Source/UserSession/ZMUserSessionTestsBase.swift | Removed mock setup for deleted setSyncDelegate method |
| WireDomain/Tests/WireDomainTests/Synchronization/* | Updated test mocks and added new tests for live broken group notification |
| wire-ios-data-model/WireDataModel.xcodeproj/project.pbxproj | Removed MLSSyncDelegate.swift from project file |
| wire-ios-data-model/Support/Sourcery/generated/AutoMockable.generated.swift | Removed generated mock for MLSSyncDelegate |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger | ||
| .info( | ||
| "cannot resolve conversation epoch \(String(describing: error)) for (\(groupID.safeForLoggingDescription))" | ||
| .error( | ||
| "cannot resolve conversation epoch \(String(describing: error))", | ||
| attributes: groupID.safeAttributes, | ||
| [.conversationId: conversationID?.safeForLoggingDescription ?? "<nil>"] | ||
| ) |
There was a problem hiding this comment.
The syntax used here for passing multiple LogAttributes appears unusual. It seems like the intention is to merge two attribute dictionaries: groupID.safeAttributes and an additional dictionary with conversationId.
Verify that this logging call works as expected. The typical pattern for merging LogAttributes would be to combine them into a single dictionary before passing them to the logger, or use a dictionary merging operator if the logger doesn't support variadic attributes.
| Task { [weak self] in | ||
| await self?.workAgent.submitItem(item) |
There was a problem hiding this comment.
The self is already captured as weak in the outer closure on line 884. Capturing it again as weak in the nested Task block is redundant because the Task is not creating a strong reference cycle—it's a detached concurrent task that will complete independently.
Consider simplifying to just use self without the capture list in the Task block, since the outer closure already prevents a retain cycle.
| Task { [weak self] in | |
| await self?.workAgent.submitItem(item) | |
| Task { | |
| await self.workAgent.submitItem(item) |
Test Results 4 files 620 suites 6m 34s ⏱️ For more details on these failures, see this check. Results for commit 3769fba. ♻️ This comment has been updated with latest results. Summary: workflow run #21752119172 |
| if DeveloperFlag.ignoreIncomingEvents.isOn { | ||
| logger.warn( | ||
| "ignore incoming events", | ||
| attributes: .incrementalSyncV3 + [.eventEnvelopeID: envelope.id] | ||
| ) | ||
| await acknowledgeUntilEnvelope(envelope, through: pushChannel, batchSize: 1) | ||
| continue | ||
| } |
There was a problem hiding this comment.
question: What is this for? Testing?
There was a problem hiding this comment.
yes, it was there before in legacy sync. It skips events so when you set it off again you're out of sync in mls conversations
| // MARK: High level syncs | ||
|
|
||
| public private(set) lazy var syncStateSubject = CurrentValueSubject<SyncState, Never>(.idle) | ||
| private(set) lazy var liveBrokenGroupSubject = PassthroughSubject<Set<String>, Never>() |
There was a problem hiding this comment.
sanity check: is there is any issue with this being a passthrough subject vs a current value subject 🤔. Does it matter that new subscribers won't get the existing IDs?
I assume that the setup is called before we have any broken groups have been discovered.
There was a problem hiding this comment.
yes, we only want to get live events so the current value does not matter, it's just a signal to repair groups when the websocket is opened
johnxnguyen
left a comment
There was a problem hiding this comment.
I started this PR but @netbe finished it. I can't approve it explicitly but I have reviewed it and do approve.
|


Issue
When a broken mls group is detected ("wrong epoch" error when decrypting a message), the MLS service will attempt to perform an incremental sync and then repair the group. Triggering syncs as a recovery mechanism is no longer correct, instead the group ID should be noted and the group then repair explicitly later in time.
This PR removes the automatic sync behavior and introduces the handling to track the broken group id and schedule a repair work item.
Testing
?
Checklist
[WPB-XXX].UI accessibility checklist
If your PR includes UI changes, please utilize this checklist: