Skip to content

Cherry-pick #4174 and #4186 from 2.7 to main#4193

Open
pcnudde wants to merge 3 commits intoNVIDIA:mainfrom
pcnudde:cherrypick/4174-4186-to-main
Open

Cherry-pick #4174 and #4186 from 2.7 to main#4193
pcnudde wants to merge 3 commits intoNVIDIA:mainfrom
pcnudde:cherrypick/4174-4186-to-main

Conversation

@pcnudde
Copy link
Collaborator

@pcnudde pcnudde commented Feb 13, 2026

Summary

### Description

Do not hold the lock around produce_item. It is not needed and this operation can be slow. We do not want/need to hold up everying during this time.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…VIDIA#4186)

## Summary
- avoid synchronous self-message path when trainer submits learn result
to itself (aggr == self.me)
- process local submission via _process_learn_result with local peer
context, while keeping remote path unchanged
- add unit coverage to verify local self-aggregation submission does not
call broadcast_and_wait

## Problem
PR NVIDIA#4141 fixed self-message deadlock in _scatter, but result submission
in do_learn_task still used broadcast_and_wait(targets=[aggr]). When
aggr == self.me with tensor streaming enabled, this can deadlock in
synchronous self-message processing.

## Test Plan
- added focused unit test in
tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py
- validated syntax locally for modified files
- full pytest not run in this environment (pytest not available)
@pcnudde
Copy link
Collaborator Author

pcnudde commented Feb 13, 2026

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

This PR cherry-picks two critical concurrency fixes from the 2.7 branch: reducing lock scope in Cacheable._get_item to prevent blocking concurrent receivers during item production, and avoiding deadlock when swarm trainer submits results to itself by bypassing synchronous broadcast_and_wait in favor of direct local submission.

Key Changes:

  • cacheable.py: Moved produce_item() call outside the lock so concurrent receivers aren't blocked. Handles race condition where two receivers might produce the same item by checking if another thread already cached it after re-acquiring the lock.
  • swarm_client_ctl.py: Added conditional logic to detect when aggregator is self (aggr == self.me) and call _process_learn_result directly with properly cloned context instead of using broadcast_and_wait, which would cause synchronous self-message deadlock.
  • Test coverage: Added TestSwarmResultSubmissionFix with comprehensive test that verifies broadcast_and_wait is not called when submitting to self, and proper FL context setup with peer context.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - both fixes address real concurrency issues with well-tested solutions
  • Both changes are cherry-picks from the 2.7 branch that address documented concurrency issues. The lock scope reduction in cacheable.py correctly handles the race condition where multiple threads might produce the same item. The deadlock fix in swarm_client_ctl.py properly clones the FL context and sets up peer context for local submission. The new unit test provides excellent coverage of the self-submission scenario and validates that broadcast_and_wait is bypassed. Code is clean, well-commented, and follows existing patterns.
  • No files require special attention

Important Files Changed

Filename Overview
nvflare/app_common/ccwf/swarm_client_ctl.py Added local submission path when aggregator is self to avoid deadlock from synchronous self-message through broadcast_and_wait
nvflare/fuel/f3/streaming/cacheable.py Reduced lock scope by moving produce_item outside lock to prevent blocking concurrent receivers, handles concurrent production correctly
tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py Added comprehensive test coverage for local result submission fix, verifies broadcast_and_wait is bypassed and proper context setup

Sequence Diagram

sequenceDiagram
    participant Trainer as SwarmClientController (Trainer)
    participant Engine as FL Engine
    participant Aggregator as SwarmClientController (Aggregator)

    Note over Trainer: Scenario: Self-submission (aggr == self.me)
    
    Trainer->>Engine: Request submission permission
    Engine-->>Trainer: Permission granted
    
    alt Before Fix: Using broadcast_and_wait
        Trainer->>Trainer: broadcast_and_wait([self])
        Note over Trainer: DEADLOCK: Synchronous self-message<br/>blocks waiting for own response
    end
    
    alt After Fix: Local submission
        Trainer->>Trainer: Detect aggr == self.me
        Trainer->>Trainer: Clone FL context
        Trainer->>Trainer: Set peer context
        Trainer->>Trainer: _process_learn_result(result, local_fl_ctx)
        Note over Trainer: Direct method call,<br/>no message passing
        Trainer-->>Trainer: Reply (OK)
    end

    Note over Trainer,Aggregator: Scenario: Remote submission (aggr != self.me)
    
    Trainer->>Engine: Request submission permission
    Engine-->>Trainer: Permission granted
    Trainer->>Aggregator: broadcast_and_wait([aggr])
    Aggregator->>Aggregator: _process_learn_result()
    Aggregator-->>Trainer: Reply (OK)
Loading

Last reviewed commit: 15d616e

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@pcnudde
Copy link
Collaborator Author

pcnudde commented Feb 13, 2026

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant