Skip to content

Conversation

@ZiyueXu77
Copy link
Collaborator

@ZiyueXu77 ZiyueXu77 commented Aug 27, 2025

Description

Currently we control the number of model/aggregator versions on server and on clients only with "max_model_history", as model versions scale, this can be both:

  • limiting as it can be hard to set the history to a proper value
  • wasting resource as some later versions can become obsolete before the earlier versions due to device latency variations

Therefore, updated the behavior of device selection dict to keep record of model versions to reflect the current active model versions, only those active will be kept, and others will be considered obsolete and removed since no device is working on them anymore. This saves resource while keep all updates' contribution without making hard-cut removal. The max_model_history can still be set, with the default updated to inf (apply all updates to global model).

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@ZiyueXu77 ZiyueXu77 changed the title Add auto version cleaning feature for aggr and global model version keeping Add auto version cleaning for aggr and global model version keeping Aug 27, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an auto version cleaning feature for managing model and aggregator versions more efficiently. Instead of relying solely on hard-coded history limits, the system now tracks active model versions based on devices currently working on them and only keeps those versions in memory.

  • Replaces fixed max_model_history limits with intelligent version tracking based on active device assignments
  • Changes default values from finite integers to float("inf") for max_model_history and max_num_active_model_versions
  • Adds device wait timeout functionality to handle insufficient device scenarios gracefully

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
nvflare/edge/updaters/emd.py Updates aggregator version cleanup logic to use active model versions from device selection
nvflare/edge/tools/edge_recipe.py Changes ModelManagerConfig defaults to infinity and adds validation for float/int parameters
nvflare/edge/tools/edge_job.py Updates configure_client method to accept infinity values for max_model_versions
nvflare/edge/executors/edge_model_executor.py Updates EdgeModelExecutor constructor to support infinity values
nvflare/edge/assessors/model_update.py Adds device wait timeout functionality and integrates active version tracking
nvflare/edge/assessors/model_manager.py Adds abstract keep_model_versions method to ModelManager interface
nvflare/edge/assessors/device_manager.py Adds used_devices tracking to base DeviceManager class
nvflare/edge/assessors/buff_model_manager.py Implements keep_model_versions method and updates version cleanup logic
nvflare/edge/assessors/buff_device_manager.py Updates device selection logic and improves device availability checks
examples/advanced/edge/jobs/pt_job.py Removes hardcoded max_model_history values to use new defaults

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments.

YuanTingHsieh
YuanTingHsieh previously approved these changes Sep 10, 2025
@ZiyueXu77
Copy link
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 enabled auto-merge (squash) September 10, 2025 21:51
@ZiyueXu77
Copy link
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 merged commit a87c7d3 into NVIDIA:main Sep 10, 2025
20 checks passed
@ZiyueXu77 ZiyueXu77 deleted the multi_updr branch September 11, 2025 13:20
chesterxgchen pushed a commit to chesterxgchen/NVFlare that referenced this pull request Sep 30, 2025
…VIDIA#3642)

### Description

Currently we control the number of model/aggregator versions on server
and on clients only with "max_model_history", as model versions scale,
this can be both:
- limiting as it can be hard to set the history to a proper value
- wasting resource as some later versions can become obsolete before the
earlier versions due to device latency variations

Therefore, updated the behavior of device selection dict to keep record
of model versions to reflect the current active model versions, only
those active will be kept, and others will be considered obsolete and
removed since no device is working on them anymore. This saves resource
while keep all updates' contribution without making hard-cut removal.
The max_model_history can still be set, with the default updated to inf
(apply all updates to global model).

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants