fix: restrict policy var save for distributed setup #491

jatinsharechat · 2025-04-03T09:58:34Z

Description

Issue
- The current saving logic only saves restrict policy parameters for rank == 0.
- For rank != 0, the restrict_var is not restored, leading to an unsynchronized state between the embedding variable and restrict_var.
- As a result, the restrict policy does not work correctly in distributed training.
Fixes
- Added _maybe_save_restrict_policy_params function:
  - Checks if de_var has a restrict_policy.
  - Saves the associated restrict policy variable to the file system for each rank.
- Updated _traverse_emb_layers_and_save to:
  - Call _maybe_save_restrict_policy_params for each distributed embedding variable (de_var).
  - Ensure restrict policy parameters are saved and restored for all Horovod ranks (hvd.rank()).

Type of change

Checklist:

I've properly formatted my code according to the guidelines
- By running yapf
- By running clang-format
This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Written test case tensorflow_recommenders_addons/dynamic_embedding/python/kernel_tests/horovod_embedding_restrict_save_test.py that trains a dummy model for some steps, then saves the model.
The model's embedding-table is created with restrict policy.
Test-case checks if the restrict-policy is saved for each rank.

To simulate distributed environment test-case was tested using horovodrun for CPU based distributed setup

$ horovodrun -np 2 python tensorflow_recommenders_addons/dynamic_embedding/python/kernel_tests/horovod_distributed_restrict_policy_save.py

Output of saved-model from above test-case:

$ ls /tmp/hvd_distributed_restrict_policy_save_timestamp210/variables/TFRADynamicEmbedding/
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_m_mht_1of1_rank0_size2-keys    all2all_emb-parameter_mht_1of1_rank0_size2-keys
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_m_mht_1of1_rank0_size2-values  all2all_emb-parameter_mht_1of1_rank0_size2-values
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_m_mht_1of1_rank1_size2-keys    all2all_emb-parameter_mht_1of1_rank1_size2-keys
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_m_mht_1of1_rank1_size2-values  all2all_emb-parameter_mht_1of1_rank1_size2-values
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_v_mht_1of1_rank0_size2-keys    all2all_emb-parameter_timestamp_mht_1of1_rank0_size2-keys
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_v_mht_1of1_rank0_size2-values  all2all_emb-parameter_timestamp_mht_1of1_rank0_size2-values
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_v_mht_1of1_rank1_size2-keys    all2all_emb-parameter_timestamp_mht_1of1_rank1_size2-keys
all2all_emb-parameter_DynamicEmbedding_all2all_emb-shadow_v_mht_1of1_rank1_size2-values  all2all_emb-parameter_timestamp_mht_1of1_rank1_size2-values

google-cla · 2025-04-03T09:58:38Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jatinsharechat · 2025-04-04T14:40:51Z

Hey @rhdong, @jq, @MoFHeka, if I can get some help with review of this would be great! Thanks

rhdong · 2025-04-10T18:44:30Z

Hey @rhdong, @jq, @MoFHeka, if I can get some help with review of this would be great! Thanks

Hi @jatinsharechat , thanks for your contribution! The CLA needed to be signed; please follow the guidance: https://github.com/tensorflow/recommenders-addons/pull/491/checks?check_run_id=39908185786. cc @jq @MoFHeka

jatinsharechat · 2025-04-10T19:02:54Z

Hey @rhdong, @jq, @MoFHeka, if I can get some help with review of this would be great! Thanks

Hi @jatinsharechat , thanks for your contribution! The CLA needed to be signed; please follow the guidance: https://github.com/tensorflow/recommenders-addons/pull/491/checks?check_run_id=39908185786. cc @jq @MoFHeka

I've signed off the CLA and the rescan is green.

rhdong

Trigger CI

…atinsharechat/recommenders-addons into jatin/restriction-policy-save

jq · 2025-04-15T20:23:54Z

the code format is failing, you may run yapf

jq · 2025-04-15T20:31:06Z

tensorflow_recommenders_addons/dynamic_embedding/python/keras/callbacks.py


  def _save_de_model(self, filepath):
+
+    def _maybe_save_restrict_policy_params(de_var, proc_size=1, proc_rank=0):


use one _maybe_save_restrict_policy_params?

Since the code is pretty minimal and calling de_var.save_to_file_system under the hood I thought might be okay to replicate the same function.

Any suggestions where to move the util function to share between the two? Just import from tensorflow_recommenders_addons.dynamic_embedding.python.keras.models._maybe_save_restrict_policy_params in callbacks.py or and use or something else?

jatinsharechat · 2025-04-28T12:32:43Z

Gentle ping on this @jq @rhdong

jatinsharechat added 10 commits March 27, 2025 09:02

fix: restrict policy var save for distributed setup

3855831

update

9d20bac

udpate

30319b0

update logic

0d24b09

cleanup

670eb05

lint

6885b3b

lint

69ce357

lint

e53c069

Add test-case for restrict policy save

e8abbe6

remove extra file

170fe71

jatinsharechat requested a review from rhdong as a code owner April 3, 2025 09:58

rhdong requested review from jq and MoFHeka and removed request for rhdong April 10, 2025 18:39

rhdong previously approved these changes Apr 10, 2025

View reviewed changes

update

1730382

jatinsharechat dismissed rhdong’s stale review via 1730382 April 11, 2025 02:43

jatinsharechat added 4 commits April 11, 2025 08:14

update

3d09a96

lint

a98ad2a

Update tests + linting

3cb9d42

remove pytest.txt

b2ab454

jatinsharechat requested a review from rhdong April 11, 2025 05:19

jatinsharechat added 4 commits April 11, 2025 06:01

Support restrict var save for DEHvdModelCheckpoint

c309f32

Test cases + linting

1859041

lint

1f5b004

:Merge branch 'jatin/restriction-policy-save' of https://github.com/j…

ce6dd55

…atinsharechat/recommenders-addons into jatin/restriction-policy-save

jq reviewed Apr 15, 2025

View reviewed changes

clang format

74b29a3

jatinsharechat requested a review from jq April 21, 2025 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: restrict policy var save for distributed setup #491

fix: restrict policy var save for distributed setup #491

Uh oh!

jatinsharechat commented Apr 3, 2025

Uh oh!

google-cla bot commented Apr 3, 2025

Uh oh!

jatinsharechat commented Apr 4, 2025

Uh oh!

rhdong commented Apr 10, 2025

Uh oh!

jatinsharechat commented Apr 10, 2025

Uh oh!

rhdong left a comment

Uh oh!

jq commented Apr 15, 2025 •

edited

Loading

Uh oh!

jq Apr 15, 2025

Uh oh!

jatinsharechat Apr 16, 2025

Uh oh!

jatinsharechat commented Apr 28, 2025

Uh oh!

Uh oh!


		def _save_de_model(self, filepath):

		def _maybe_save_restrict_policy_params(de_var, proc_size=1, proc_rank=0):

fix: restrict policy var save for distributed setup #491

Are you sure you want to change the base?

fix: restrict policy var save for distributed setup #491

Uh oh!

Conversation

jatinsharechat commented Apr 3, 2025

Description

Type of change

Checklist:

How Has This Been Tested?

Uh oh!

google-cla bot commented Apr 3, 2025

Uh oh!

jatinsharechat commented Apr 4, 2025

Uh oh!

rhdong commented Apr 10, 2025

Uh oh!

jatinsharechat commented Apr 10, 2025

Uh oh!

rhdong left a comment

Choose a reason for hiding this comment

Uh oh!

jq commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jq Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jatinsharechat Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

jatinsharechat commented Apr 28, 2025

Uh oh!

Uh oh!

jq commented Apr 15, 2025 •

edited

Loading