Replies: 1 comment
-
Ok, I've realized that I don't have to set up mesh globally, I can specify it with each sharding instead, which would allow me to pass different meshes for different computations. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Imagine the following hypothetical scenario. I have 16 devices. I also have some jitted computation, some parts of which I would like to shard as if my mesh was 8x2, but other parts should be sharded as if it was 4x4.
Is it true that currently there is no way to achieve what I need because jit assumes a single fixed device mesh? If so, what can I do to work around this limitation.
I can give some rationale for why I'd need to shard the computations in this fashion if needed, but basically it comes down to minimizing communications in a MoE-like model.
Beta Was this translation helpful? Give feedback.
All reactions