-
Notifications
You must be signed in to change notification settings - Fork 239
Add auto version cleaning for aggr and global model version keeping #3642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an auto version cleaning feature for managing model and aggregator versions more efficiently. Instead of relying solely on hard-coded history limits, the system now tracks active model versions based on devices currently working on them and only keeps those versions in memory.
- Replaces fixed
max_model_historylimits with intelligent version tracking based on active device assignments - Changes default values from finite integers to
float("inf")formax_model_historyandmax_num_active_model_versions - Adds device wait timeout functionality to handle insufficient device scenarios gracefully
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/edge/updaters/emd.py | Updates aggregator version cleanup logic to use active model versions from device selection |
| nvflare/edge/tools/edge_recipe.py | Changes ModelManagerConfig defaults to infinity and adds validation for float/int parameters |
| nvflare/edge/tools/edge_job.py | Updates configure_client method to accept infinity values for max_model_versions |
| nvflare/edge/executors/edge_model_executor.py | Updates EdgeModelExecutor constructor to support infinity values |
| nvflare/edge/assessors/model_update.py | Adds device wait timeout functionality and integrates active version tracking |
| nvflare/edge/assessors/model_manager.py | Adds abstract keep_model_versions method to ModelManager interface |
| nvflare/edge/assessors/device_manager.py | Adds used_devices tracking to base DeviceManager class |
| nvflare/edge/assessors/buff_model_manager.py | Implements keep_model_versions method and updates version cleanup logic |
| nvflare/edge/assessors/buff_device_manager.py | Updates device selection logic and improves device availability checks |
| examples/advanced/edge/jobs/pt_job.py | Removes hardcoded max_model_history values to use new defaults |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
yanchengnv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments.
|
/build |
|
/build |
…VIDIA#3642) ### Description Currently we control the number of model/aggregator versions on server and on clients only with "max_model_history", as model versions scale, this can be both: - limiting as it can be hard to set the history to a proper value - wasting resource as some later versions can become obsolete before the earlier versions due to device latency variations Therefore, updated the behavior of device selection dict to keep record of model versions to reflect the current active model versions, only those active will be kept, and others will be considered obsolete and removed since no device is working on them anymore. This saves resource while keep all updates' contribution without making hard-cut removal. The max_model_history can still be set, with the default updated to inf (apply all updates to global model). ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Description
Currently we control the number of model/aggregator versions on server and on clients only with "max_model_history", as model versions scale, this can be both:
Therefore, updated the behavior of device selection dict to keep record of model versions to reflect the current active model versions, only those active will be kept, and others will be considered obsolete and removed since no device is working on them anymore. This saves resource while keep all updates' contribution without making hard-cut removal. The max_model_history can still be set, with the default updated to inf (apply all updates to global model).
Types of changes
./runtest.sh.