You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently supports *mixed precision*. Will upcast the router and the sharded experts if turned on.
48
+
- However this is hard-coded to off at the moment.
49
+
- The FSDP mixed precision works independenly of the MoE one.
50
+
47
51
Not all of the features of `megablocks` are being incorporated; listing down some of the restrictions of the current integration:
48
52
- currently not passing the data parallel `dp_mesh` to the `FSDP` constructor, so `FSDP` will always shard over the default process group (over world_size).
49
53
- now support only loading *sharded*`safetensor` non-GGUF MoE checkpoints. This is a reasonable assumption since MoE checkpoints are typically above the size limit that prevents it being saved into a single checkpoint filed.
50
54
- only supports the *dropless sparse* MLPs in the megablocks package; the other variations like non-dropless and grouped computes are not currently integrated.
51
55
- the `shard_moe` may not scale well with larger models as the current implementation `torch.concat` all the expert weights together before passing to `torch.distributed` to be sharded. This is redundently done in all devices, so it is inefficient.
52
56
- currently only supports `StateDictType.SHARDED_STATE_DICT` because the implementation uses `DTensors` which have limited support for full state dicts. However for efficiency considerations, sharded state dicts are the most efficient.
53
-
- currently may not support *mixed precision* properly; need to ascertain more clearly how the sharded `DTensors` are upcasted in the optimizer (if at all).
0 commit comments