Commit 794110e
gshard builder and layers for MoE lifelong pretraining
Now supports:
* "e_dim_old" argument in UniTransformer to expand new experts and gating dimensions from old ones.
* "epad_idx_old" argument in UniTransformer to mark extra experts and gating dimensions to be inactive (for the case where the requested number of experts is not divisible by 2^n, e.g. 28 experts will leave 4 remaining experts muted)
* Merge/Split experts and gating dimensions for loading checkpoints into a new MoE with expanded experts and gatings.
* KL_div loss for MoE Lifelong Learning
PiperOrigin-RevId: 4919773101 parent 9445632 commit 794110e
2 files changed
+750
-14
lines changed
0 commit comments