Add auto version cleaning for aggr and global model version keeping #3642

ZiyueXu77 · 2025-08-27T15:30:46Z

Description

Currently we control the number of model/aggregator versions on server and on clients only with "max_model_history", as model versions scale, this can be both:

limiting as it can be hard to set the history to a proper value
wasting resource as some later versions can become obsolete before the earlier versions due to device latency variations

Therefore, updated the behavior of device selection dict to keep record of model versions to reflect the current active model versions, only those active will be kept, and others will be considered obsolete and removed since no device is working on them anymore. This saves resource while keep all updates' contribution without making hard-cut removal. The max_model_history can still be set, with the default updated to inf (apply all updates to global model).

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

…eeping

Copilot

Pull Request Overview

This PR implements an auto version cleaning feature for managing model and aggregator versions more efficiently. Instead of relying solely on hard-coded history limits, the system now tracks active model versions based on devices currently working on them and only keeps those versions in memory.

Replaces fixed max_model_history limits with intelligent version tracking based on active device assignments
Changes default values from finite integers to float("inf") for max_model_history and max_num_active_model_versions
Adds device wait timeout functionality to handle insufficient device scenarios gracefully

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
nvflare/edge/updaters/emd.py	Updates aggregator version cleanup logic to use active model versions from device selection
nvflare/edge/tools/edge_recipe.py	Changes ModelManagerConfig defaults to infinity and adds validation for float/int parameters
nvflare/edge/tools/edge_job.py	Updates configure_client method to accept infinity values for max_model_versions
nvflare/edge/executors/edge_model_executor.py	Updates EdgeModelExecutor constructor to support infinity values
nvflare/edge/assessors/model_update.py	Adds device wait timeout functionality and integrates active version tracking
nvflare/edge/assessors/model_manager.py	Adds abstract keep_model_versions method to ModelManager interface
nvflare/edge/assessors/device_manager.py	Adds used_devices tracking to base DeviceManager class
nvflare/edge/assessors/buff_model_manager.py	Implements keep_model_versions method and updates version cleanup logic
nvflare/edge/assessors/buff_device_manager.py	Updates device selection logic and improves device availability checks
examples/advanced/edge/jobs/pt_job.py	Removes hardcoded max_model_history values to use new defaults

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nvflare/edge/assessors/buff_model_manager.py

nvflare/edge/assessors/model_update.py

nvflare/edge/tools/edge_job.py

nvflare/edge/assessors/buff_device_manager.py

yanchengnv

See my comments.

nvflare/edge/assessors/buff_device_manager.py

examples/advanced/edge/jobs/pt_job.py

nvflare/edge/assessors/buff_model_manager.py

nvflare/edge/assessors/model_update.py

nvflare/edge/executors/edge_model_executor.py

nvflare/edge/tools/edge_job.py

nvflare/edge/tools/edge_recipe.py

nvflare/edge/assessors/buff_device_manager.py

nvflare/edge/assessors/buff_model_manager.py

nvflare/edge/assessors/model_manager.py

nvflare/edge/executors/edge_model_executor.py

nvflare/edge/updaters/emd.py

nvflare/edge/assessors/buff_model_manager.py

nvflare/edge/assessors/model_update.py

nvflare/edge/executors/edge_model_executor.py

nvflare/edge/tools/edge_job.py

nvflare/edge/updaters/emd.py

ZiyueXu77 · 2025-09-10T21:51:32Z

/build

ZiyueXu77 · 2025-09-10T21:59:32Z

/build

…VIDIA#3642) ### Description Currently we control the number of model/aggregator versions on server and on clients only with "max_model_history", as model versions scale, this can be both: - limiting as it can be hard to set the history to a proper value - wasting resource as some later versions can become obsolete before the earlier versions due to device latency variations Therefore, updated the behavior of device selection dict to keep record of model versions to reflect the current active model versions, only those active will be kept, and others will be considered obsolete and removed since no device is working on them anymore. This saves resource while keep all updates' contribution without making hard-cut removal. The max_model_history can still be set, with the default updated to inf (apply all updates to global model). ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

ZiyueXu77 added 7 commits August 25, 2025 11:52

add wait info print and timeout

e7af5da

handle both initial and during fl waiting

dc8e0da

polish the logic

5698d6d

bug correction

73af52f

updates for max model version

6e37c65

Merge branch 'main' into aggr_upd

5f67852

add auto version cleaning feature for aggr and global model version k…

0710a55

…eeping

ZiyueXu77 requested review from Copilot and yanchengnv August 27, 2025 15:30

ZiyueXu77 changed the title ~~Add auto version cleaning feature for aggr and global model version keeping~~ Add auto version cleaning for aggr and global model version keeping Aug 27, 2025

Copilot AI reviewed Aug 27, 2025

View reviewed changes

nvflare/edge/assessors/buff_model_manager.py Show resolved Hide resolved

nvflare/edge/assessors/model_update.py Outdated Show resolved Hide resolved

nvflare/edge/tools/edge_job.py Outdated Show resolved Hide resolved

nvflare/edge/assessors/buff_device_manager.py Show resolved Hide resolved

ZiyueXu77 added 3 commits August 27, 2025 11:40

bug fix

4c4778a

update docstring and add check

c64969c

format update

0d1b5d1

yanchengnv reviewed Sep 2, 2025

View reviewed changes

ZiyueXu77 added 3 commits September 10, 2025 12:21

address merge conflicts

152edd4

address merge conflicts

7dd2243

Merge branch 'main' into multi_updr

fd8d615

YuanTingHsieh reviewed Sep 10, 2025

View reviewed changes

ZiyueXu77 added 3 commits September 10, 2025 14:20

address comments

198bbd5

add back prune logic

1c23238

Merge branch 'main' into multi_updr

0a4d132

YuanTingHsieh reviewed Sep 10, 2025

View reviewed changes

ZiyueXu77 added 2 commits September 10, 2025 15:55

further updates on comments

94210e3

format updates

20b528a

YuanTingHsieh previously approved these changes Sep 10, 2025

View reviewed changes

update docstring to give device selection dict info

0d13b27

ZiyueXu77 dismissed YuanTingHsieh’s stale review via 0d13b27 September 10, 2025 21:48

Fix typo: change 'model_ids' to 'selection_ids'

b61f111

ZiyueXu77 enabled auto-merge (squash) September 10, 2025 21:51

YuanTingHsieh approved these changes Sep 10, 2025

View reviewed changes

Merge branch 'main' into multi_updr

7273a5b

ZiyueXu77 merged commit a87c7d3 into NVIDIA:main Sep 10, 2025
20 checks passed

ZiyueXu77 deleted the multi_updr branch September 11, 2025 13:20

Add auto version cleaning for aggr and global model version keeping #3642

Add auto version cleaning for aggr and global model version keeping #3642

Uh oh!

Conversation

ZiyueXu77 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yanchengnv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZiyueXu77 commented Sep 10, 2025

Uh oh!

ZiyueXu77 commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZiyueXu77 commented Aug 27, 2025 •

edited

Loading