-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
I'm not sure if we really recommend it like that anywhere, but I think it's natural to write code like this:
for p in net.parameters():
p.weight_decay = 0.0001I noticed, this has several problems:
- What about auxiliary parameters? You probably don't want weight decay on them. Same as any integer of boolean parameters.
- I think actually it would be ignored by RETURNN, so maybe it's not a problem? Or we could also just ignore it silently on returnn-common side to allow for such code?
- Some variables maybe should not be decayed:
- In
LayerNorm,WeightNormetc, thescaleparameter, which is initialized at 1. Any decay should move it towards 1 and not towards 0. (Right?) In Lingvo, you actually find (here) that weight norm is reparameterized as(1 + g)instead of justg, to avoid this problem.- We could rewrite any such code to also use such reparameterization. Which is maybe a good thing but maybe not?
- We could add some additional information, like
decay_centeror so, and the constraint would not be$w^2$ but$(w-c)^2$ instead, such that any configured weight decay would move it towards the configured center. This would need some extra implementation also on RETURNN side. - We could also add some flag
Parameter.ignore_weight_decayon returnn-common side, and if that is enabled (via the module such asLayerNorm), it ignores any writes toweight_decay.
- I'm not sure if a decay on biases is good or not.
- In
Many of the arguments are to actually allow for the simple code above. Or maybe we don't want to allow such simple code? But how exactly would the canonical example of weight decay applied on some generic network look like then?
Metadata
Metadata
Assignees
Labels
No labels