Fix trainable tokens with fsdp (#2681)

BenjaminBossan · web-flow · commit 1c853eaaad4c · 2025-07-30T14:33:53.000+02:00
When using FSDP with trainable tokens, there was an error when
retrieving the state_dict of the TrainableTokensWrapper. The reason is
that for the state_dict that is passed to get_peft_model_state_dict, the
FSDP wrapper was already unwrapped, which means the keys don't have the
FSDP-specific prefix. However, in the PEFT code, when looking up keys
from said state_dict, the prefix was not removed. Now it is removed,
making the lookup succeed. The same logic applies to
set_peft_model_state_dict.

I could successfully start training with FSDP and trainable tokens
locally by adjusting the examples/sft script to include trainable
tokens. Checkpoints could be successfully created and resumed from. The
only change I needed to make was to configure use_orig_params=True for
FSDP.
diff --git a/src/peft/tuners/lora/config.py b/src/peft/tuners/lora/config.py
@@ -283,7 +283,7 @@ class LoraConfig(PeftConfig):
             Either you specify a list of indices which will then target the model's input embedding layer (or, if not
             found, `embed_tokens`). Alternatively, you can specify a dictionary where the key is the name of the
             embedding module and the values are the list of token indices, e.g. `{'embed_tokens': [0, 1, ...]}`. Note
-            that training with FSDP/DeepSpeed might not yet be fully supported with this option enabled.
+            that training with FSDP requires `use_orig_params=True` to avoid issues with non-uniform `requires_grad`.
         loftq_config (`Optional[LoftQConfig]`):
             The configuration of LoftQ. If this is not None, then LoftQ will be used to quantize the backbone weights
             and initialize Lora layers. Also pass `init_lora_weights='loftq'`. Note that you should not pass a
@@ -465,9 +465,8 @@ class LoraConfig(PeftConfig):
                 "in two ways. Either you specify a list of indices which will then target the model's input embedding "
                 "layer (or, if not found, `embed_tokens`). Alternatively, you can specify a dictionary where the key "
                 "is the name of the embedding module and the values are the list of token indices, e.g. "
-                "`{'embed_tokens': [0, 1, ...]}`. "
-                "Note that training with FSDP/DeepSpeed might not yet be fully supported with this option enabled. "
-                "Also note that models using weight-tying are currently not supported."
+                "`{'embed_tokens': [0, 1, ...]}`. Note that training with FSDP requires `use_orig_params=True` to "
+                "avoid issues with non-uniform `requires_grad`."
             )
         },
     )
diff --git a/src/peft/utils/save_and_load.py b/src/peft/utils/save_and_load.py
@@ -219,6 +219,10 @@ def renamed_dora_weights(k):
     # ADDITIONAL TRAINING MODULES / MODULES_TO_SAVE
     for name, module in model.named_modules():
         if isinstance(module, AuxiliaryTrainingWrapper):
+            if name.startswith("_fsdp_wrapped_module."):
+                # If FSDP is used, the state_dict is from the unwrapped model, which will result in a key mismatch if we
+                # don't remove the FSDP-specific prefix
+                name = name.removeprefix("_fsdp_wrapped_module.")
             # Compute the module-relative state dict to make it easier for the adapter to fetch the appropriate
             # keys that the module thinks need to be saved. We cannot rely on `.state_dict()` internally of the
             # module since accelerators like DeepSpeed require special handling which is done for the model
@@ -381,6 +385,10 @@ def set_peft_model_state_dict(
             # `modules_to_save.{adapter_name}.` prefix. This prefix must be restored when loading the model from the
             # saved state dict which is why we fetch a load key map from the wrapper.
             key_map = module.adapter_state_dict_load_map(adapter_name)
+            if name.startswith("_fsdp_wrapped_module."):
+                # If FSDP is used, the state_dict is from the unwrapped model, which will result in a key mismatch if we
+                # don't remove the FSDP-specific prefix
+                name = name.removeprefix("_fsdp_wrapped_module.")
             for k in key_map:
                 lookup_key = f"{name}.{k}"
                 store_key = f"{name}.{key_map[k]}"