-
Notifications
You must be signed in to change notification settings - Fork 142
fix: restrict policy var save for distributed setup #491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: restrict policy var save for distributed setup #491
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Hi @jatinsharechat , thanks for your contribution! The CLA needed to be signed; please follow the guidance: https://github.com/tensorflow/recommenders-addons/pull/491/checks?check_run_id=39908185786. cc @jq @MoFHeka |
I've signed off the CLA and the rescan is green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trigger CI
the code format is failing, you may run yapf |
|
||
def _save_de_model(self, filepath): | ||
|
||
def _maybe_save_restrict_policy_params(de_var, proc_size=1, proc_rank=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use one _maybe_save_restrict_policy_params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the code is pretty minimal and calling de_var.save_to_file_system
under the hood I thought might be okay to replicate the same function.
Any suggestions where to move the util function to share between the two? Just import from tensorflow_recommenders_addons.dynamic_embedding.python.keras.models._maybe_save_restrict_policy_params
in callbacks.py
or and use or something else?
Description
rank == 0
.rank != 0
, therestrict_var
is not restored, leading to an unsynchronized state between the embedding variable andrestrict_var
._maybe_save_restrict_policy_params
function:de_var
has a restrict_policy._traverse_emb_layers_and_save
to:_maybe_save_restrict_policy_params
for each distributed embedding variable (de_var
).hvd.rank()
).Type of change
Checklist:
How Has This Been Tested?
tensorflow_recommenders_addons/dynamic_embedding/python/kernel_tests/horovod_embedding_restrict_save_test.py
that trains a dummy model for some steps, then saves the model.horovodrun
for CPU based distributed setup