In this example why isn't pjit giving a greater impact? #9297

mattiasmar · 2022-01-24T16:27:07Z

mattiasmar
Jan 24, 2022

Pjit'ing a method ought to have a major impact on its runtime.
I created a collab implementing a very simple MLP that gets evaluated under different PartionSpecs.
However, the difference in runtime that I measure is marginal (tested on a Xeon with 8 titan-X). Could you help me understand if there is an error in the implementation?

class MLP(nn.Module):
  """A simple MLP model."""

  @nn.compact
  def __call__(self, x):
    x = x.reshape((x.shape[0], -1))  # flatten
    x = nn.Dense(features=7840)(x)  
    x = nn.relu(x)
    x = nn.Dense(features=10)(x)
    return x

#Output using mesh (1,8)
=======
IPS: 60143±2042 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, 'y')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 60333±1666 Rules: [(('Dense_0', 'bias'), PartitionSpec('y',)), (('Dense_0', 'kernel'), PartitionSpec(None, 'y')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 61437±1306 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, 'x')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 60530±2107 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, None)), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]

=======
#Output using mesh (2,4)
IPS: 65610±695 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, 'y')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 65380±717 Rules: [(('Dense_0', 'bias'), PartitionSpec('y',)), (('Dense_0', 'kernel'), PartitionSpec(None, 'y')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 65069±1717 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, 'x')), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]
IPS: 63931±1648 Rules: [(('Dense_0', 'bias'), PartitionSpec(None,)), (('Dense_0', 'kernel'), PartitionSpec(None, None)), (('Dense_1', 'bias'), None), (('Dense_1', 'kernel'), None)]

jheek · 2022-01-25T09:20:45Z

jheek
Jan 25, 2022

In your colab you are not actually splitting the input batch (PartitionSpec = None). The partitioning of the activations is generally more more important than the partitioning of the weights. Generally you would split between batch and model parallelism in some way.

As a side note though: Micro Benchmarks like this are tricky. The trivial partitioning of replicating all weights and using only data parallelism will just be the fastest and simplest solution. Model parallelism is only necessary when you cannot fit the weights and/or the activations at batch_size=1 on a device. In this case you take the overhead of splitting weights and activations over multiple devices.

5 replies

mattiasmar Jan 25, 2022
Author

@jheek Thanks for your comment.

Experience says that many models can get better throughput with non-trivial parallelization. The projects FlexFlow and AutoMap (among others) demonstrated the value of hybrid MP/DP using MC based search, also for not so large models.
I'm ready to control this search space using a novel approach that I have had great success with in related domains. My intention was to implement the parallelization with JAX. I want to think that I'm close to controlling the workload's parallelization on a by-Op resolution in JAX, but I keep on stumbling on technicalities/misunderstandings of JAX.

Would you be interested in setting up a call and iron out the last issues? Once up running, I think this project can be valuable for the broader JAX community.

mattiasmar Jan 25, 2022
Author

With regards to splitting the input batch: Setting the 3rd and 4th value of the of in_axis_resources to P("x") had a very small effect (<5%).
In this collab example I'm not primarily looking for the optimal mapping, I just want to convince myself that the control of the parallelization is done right.

jheek Jan 26, 2022

Btw did you look into jax.experimental.pjit.with_sharding_constraint? This allows you to provide sharding constraints for intermediate values. I think this is something you will need for your use case.

More generally though, we have been discussing with the core team about pjit and it's lack of transparency (you can't see at all what it does). We plan to write a doc with some tricks that will allow you to debug the choices XLA makes when partitioning your program.

mattiasmar Jan 26, 2022
Author

I did look at it a bit, but I didn't get a clear picture of how it is intended to be used (I admit that I didn't do a step-in debugging session focused on with_sharding_constraint).

From this comment it seems with_sharding_constraint needs to be tailored inside the model code. If this is indeed a must, this would prevent me from creating a general parallelization optimizer that could be applied on any model.
In the Flax example code "siren" there is a reference to with_sharding_constraint but it doesn't appear to be used. Am I missing perhaps a hidden use of with_sharding_constraint in this code?
Also this jax issue is relevant, but it lacks responses: [model parallelism] Is it possible to pre-shard JAX arrays without relying on annotation `with_sharding_constraint`? #8597

mattiasmar Jan 26, 2022
Author

I also see in the example of the jax unittests and in the mesh-transformer-jax repo how with_sharding_constraint is used attached to every single operator that it is applied on. I wonder though: Is there any difference in using with_sharding_constraint in each and every op, compared to using a pytreedef of partionspec at the highest level (i.e. in the pjit of train_step)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In this example why isn't pjit giving a greater impact? #9297

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

In this example why isn't pjit giving a greater impact? #9297

Uh oh!

Uh oh!

mattiasmar Jan 24, 2022

Replies: 1 comment · 5 replies

Uh oh!

jheek Jan 25, 2022

Uh oh!

mattiasmar Jan 25, 2022 Author

Uh oh!

mattiasmar Jan 25, 2022 Author

Uh oh!

jheek Jan 26, 2022

Uh oh!

mattiasmar Jan 26, 2022 Author

Uh oh!

mattiasmar Jan 26, 2022 Author

mattiasmar
Jan 24, 2022

Replies: 1 comment 5 replies

jheek
Jan 25, 2022

mattiasmar Jan 25, 2022
Author

mattiasmar Jan 25, 2022
Author

mattiasmar Jan 26, 2022
Author

mattiasmar Jan 26, 2022
Author