Cherry-pick #4174 and #4186 from 2.7 to main by pcnudde · Pull Request #4193 · NVIDIA/NVFlare

pcnudde · 2026-02-13T17:22:06Z

Summary

Cherry-pick of [2.7] smaller lock in produce item #4174: reduce lock scope in Cacheable._get_item — produce_item now runs outside the lock so concurrent receivers aren't blocked
Cherry-pick of [2.7] Avoid self-message deadlock for local swarm result submission #4186: avoid self-message deadlock when swarm trainer submits learn result to itself — local submission bypasses broadcast_and_wait, adds unit test coverage

### Description Do not hold the lock around produce_item. It is not needed and this operation can be slow. We do not want/need to hold up everying during this time. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…VIDIA#4186) ## Summary - avoid synchronous self-message path when trainer submits learn result to itself (aggr == self.me) - process local submission via _process_learn_result with local peer context, while keeping remote path unchanged - add unit coverage to verify local self-aggregation submission does not call broadcast_and_wait ## Problem PR NVIDIA#4141 fixed self-message deadlock in _scatter, but result submission in do_learn_task still used broadcast_and_wait(targets=[aggr]). When aggr == self.me with tensor streaming enabled, this can deadlock in synchronous self-message processing. ## Test Plan - added focused unit test in tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py - validated syntax locally for modified files - full pytest not run in this environment (pytest not available)

pcnudde · 2026-02-13T17:23:09Z

/build

greptile-apps · 2026-02-13T17:24:25Z

Greptile Overview

Greptile Summary

This PR cherry-picks two critical concurrency fixes from the 2.7 branch: reducing lock scope in Cacheable._get_item to prevent blocking concurrent receivers during item production, and avoiding deadlock when swarm trainer submits results to itself by bypassing synchronous broadcast_and_wait in favor of direct local submission.

Key Changes:

cacheable.py: Moved produce_item() call outside the lock so concurrent receivers aren't blocked. Handles race condition where two receivers might produce the same item by checking if another thread already cached it after re-acquiring the lock.
swarm_client_ctl.py: Added conditional logic to detect when aggregator is self (aggr == self.me) and call _process_learn_result directly with properly cloned context instead of using broadcast_and_wait, which would cause synchronous self-message deadlock.
Test coverage: Added TestSwarmResultSubmissionFix with comprehensive test that verifies broadcast_and_wait is not called when submitting to self, and proper FL context setup with peer context.

Confidence Score: 5/5

This PR is safe to merge with minimal risk - both fixes address real concurrency issues with well-tested solutions
Both changes are cherry-picks from the 2.7 branch that address documented concurrency issues. The lock scope reduction in cacheable.py correctly handles the race condition where multiple threads might produce the same item. The deadlock fix in swarm_client_ctl.py properly clones the FL context and sets up peer context for local submission. The new unit test provides excellent coverage of the self-submission scenario and validates that broadcast_and_wait is bypassed. Code is clean, well-commented, and follows existing patterns.
No files require special attention

Important Files Changed

Filename	Overview
nvflare/app_common/ccwf/swarm_client_ctl.py	Added local submission path when aggregator is self to avoid deadlock from synchronous self-message through `broadcast_and_wait`
nvflare/fuel/f3/streaming/cacheable.py	Reduced lock scope by moving `produce_item` outside lock to prevent blocking concurrent receivers, handles concurrent production correctly
tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py	Added comprehensive test coverage for local result submission fix, verifies `broadcast_and_wait` is bypassed and proper context setup

Sequence Diagram

sequenceDiagram
    participant Trainer as SwarmClientController (Trainer)
    participant Engine as FL Engine
    participant Aggregator as SwarmClientController (Aggregator)

    Note over Trainer: Scenario: Self-submission (aggr == self.me)
    
    Trainer->>Engine: Request submission permission
    Engine-->>Trainer: Permission granted
    
    alt Before Fix: Using broadcast_and_wait
        Trainer->>Trainer: broadcast_and_wait([self])
        Note over Trainer: DEADLOCK: Synchronous self-message<br/>blocks waiting for own response
    end
    
    alt After Fix: Local submission
        Trainer->>Trainer: Detect aggr == self.me
        Trainer->>Trainer: Clone FL context
        Trainer->>Trainer: Set peer context
        Trainer->>Trainer: _process_learn_result(result, local_fl_ctx)
        Note over Trainer: Direct method call,<br/>no message passing
        Trainer-->>Trainer: Reply (OK)
    end

    Note over Trainer,Aggregator: Scenario: Remote submission (aggr != self.me)
    
    Trainer->>Engine: Request submission permission
    Engine-->>Trainer: Permission granted
    Trainer->>Aggregator: broadcast_and_wait([aggr])
    Aggregator->>Aggregator: _process_learn_result()
    Aggregator-->>Trainer: Reply (OK)

_{Last reviewed commit: 15d616e}

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pcnudde · 2026-02-13T21:55:30Z

/build

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pcnudde added 2 commits February 13, 2026 09:21

pcnudde requested a review from YuanTingHsieh February 13, 2026 17:23

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into cherrypick/4174-4186-to-main

15d616e

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick #4174 and #4186 from 2.7 to main#4193

Cherry-pick #4174 and #4186 from 2.7 to main#4193
pcnudde wants to merge 3 commits intoNVIDIA:mainfrom
pcnudde:cherrypick/4174-4186-to-main

pcnudde commented Feb 13, 2026

Uh oh!

pcnudde commented Feb 13, 2026

Uh oh!

greptile-apps bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

pcnudde commented Feb 13, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pcnudde commented Feb 13, 2026

Summary

Uh oh!

pcnudde commented Feb 13, 2026

Uh oh!

greptile-apps bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pcnudde commented Feb 13, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Feb 13, 2026 •

edited

Loading