vmap + pjit + with_sharding_constraint? #11798

dlwh · 2022-08-08T18:46:50Z

dlwh
Aug 8, 2022

tl;dr: Is there any way to use with_sharding_constraint inside of a vmapped function to force partitioning of the batch axis

I'm tracking down some resource use issues where I think that the partitioner is not sharding dropout masks across the batch axis. My code broadly looks like:

def transformer(...):
    # some stuff
    dropout_mask = random.bernoulli(key, 1 - dropout_prob, x.shape)

transformer = vmap(transformer, ...) # add batch dim

transformer = pjit(transformer, in_axis_resources=...)

The reason I suspect this is what's going on is that I get OOM errors that it's trying to allocating precisely the amount of ram it would need to allocate if it were not partitioning the batch dimension for the dropout mask. [1]

I can try to minimize and share actual code, but before I did, I wanted to ask:

Is there an easy way to see what decisions the partitioner is making for arrays created deep inside the computation, specifically random.bernoulli?
Is there a way to effectively use with_sharding_constraint to shard arrays like dropout_mask along the batch axis, if said axis is created by vmap?
should I just be using xmap for the batching and use pjit for the model partitioning? I think this would ensure that the batch axis is always sharded, even if it's a bit messy. Or maybe should I just not use vmap if I care about this?

The existing unit tests (https://github.com/google/jax/blob/480efcf0ee13e8c471c0b3e42a582028fcdccd3c/tests/pjit_test.py#L428 ) don't seem to test for anything like this, but maybe I'm just not understanding them.

[1] I stupidly lost the logs, but I get errors that it's trying to allocate 3.36GB, and the dropout mask in question is 128 * 25 * 1024 * 1024 which is 3.355 GB... could be a coincidence of course, but the OOM goes away when I remove dropout.

Answered by froystig

Aug 11, 2022

I got @pschuh to help me think on this one.

We recently added an experimental (intentionally not-yet-documented) option to vmap via the keyword argument spmd_axis_name that might be useful here. See #11807. What do you think?

Is there an easy way to see what decisions the partitioner is making for arrays created deep inside the computation, specifically random.bernoulli?

In general, you can't inspect pjit's sharding specs because they're applied only downstream, at compilation time.

should I just be using xmap for the batching and use pjit for the model partitioning? I think this would ensure that the batch axis is always sharded, even if it's a bit messy.

We're fairly certain that xm…

View full answer

froystig · 2022-08-11T16:54:55Z

froystig
Aug 11, 2022
Maintainer

I got @pschuh to help me think on this one.

We recently added an experimental (intentionally not-yet-documented) option to vmap via the keyword argument spmd_axis_name that might be useful here. See #11807. What do you think?

Is there an easy way to see what decisions the partitioner is making for arrays created deep inside the computation, specifically random.bernoulli?

In general, you can't inspect pjit's sharding specs because they're applied only downstream, at compilation time.

should I just be using xmap for the batching and use pjit for the model partitioning? I think this would ensure that the batch axis is always sharded, even if it's a bit messy.

We're fairly certain that xmap will override sharding specs, so it's unclear that it would work here. Maybe @apaszke can confirm or correct.

Or maybe should I just not use vmap if I care about this?

One way to avoid this question in today's world is indeed to rewrite your model to be batch-polymorphic so that vmap isn't required. We're pretty sure that Flax does that for similar reasons. Maybe @levskaya or @jekbradbury can confirm or correct.

And just thinking out loud: this might suggest that we consider enhancing custom batching (#9073) to also involve axis names. But custom batching is work in progress, on our queue to land even with its current scope. (fyi @mattjj)

Thanks, I should add! This is useful feedback.

5 replies

dlwh Aug 11, 2022
Author

Thanks @froystig, and @pschuh for implementing the feature right as I was asking about it! I think that's exactly what I need. I'll give it a shot and report back.

Is there an easy way to see what decisions the partitioner is making for arrays created deep inside the computation, specifically random.bernoulli?

In general, you can't inspect pjit's sharding specs because they're applied only downstream, at compilation time.

That makes sense. Is there a way to see it at compilation time? I recently learned about --xla_dump_to and thought it might let me figure it out, but I haven't had a chance to play with it. (I don't need programmatic access I just want to be able to see what's going on).

Or maybe should I just not use vmap if I care about this?

One way to avoid this question in today's world is indeed to rewrite your model to be batch-polymorphic so that vmap isn't required. We're pretty sure that Flax does that for similar reasons. Maybe @levskaya or @jekbradbury can confirm or correct.

That wouldn't surprise me. On a tangentially related note, I've noticed flax/t5x have sort of reimplemented a big chunk of xmap-style logical->physical resource mapping, but they key difference is that they work with the "global" view of the computation the whole way through (as opposed to xmap having you work with the "local" view). It makes me wonder if the sweet spot is something closer to "tensor considered harmful"-esque named tensors... It's a very half-baked musing though.

pschuh Aug 11, 2022
Collaborator

For the curious: XLA_FLAGS='--xla_dump_to=/tmp/output_folder/xla_dumps --xla_dump_hlo_pass_re=.*' (look for SPMD passes).

You might also be running into a very recently added bug introduced by (007d651) (fixed: #11852).

dlwh Aug 11, 2022
Author

Looks like just switching to git main was enough! Sorry, I should have tried that before.

spmd_axis_name didn't let me sneak any more into ram (though I wasn't exhaustive in this), but it's still nice to have some certainty that that axis will get partitioned!.

froystig Aug 12, 2022
Maintainer

No problem. It may be that you just barely caught #11852 on main. I agree – either way, guarantees are nice!

We might change how this spmd_axis_name bit looks as we think on it and come across other usage examples. It's good to know that it had some utility in principle here, even if temporarily.

dlwh Aug 17, 2022
Author

fyi, it appears that at least someone at Google has used xmap-in-pjit for precisely this reason:

https://github.com/google-research/google-research/blob/06e875dabc33f1204dc0bf10381013c214a5bb54/private_text_transformers/private_t5x/trainer.py#L186-L191

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vmap + pjit + with_sharding_constraint? #11798

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

vmap + pjit + with_sharding_constraint? #11798

Uh oh!

dlwh Aug 8, 2022

Replies: 1 comment · 5 replies

Uh oh!

froystig Aug 11, 2022 Maintainer

Uh oh!

dlwh Aug 11, 2022 Author

Uh oh!

pschuh Aug 11, 2022 Collaborator

Uh oh!

dlwh Aug 11, 2022 Author

Uh oh!

Uh oh!

froystig Aug 12, 2022 Maintainer

Uh oh!

Uh oh!

dlwh Aug 17, 2022 Author

dlwh
Aug 8, 2022

Replies: 1 comment 5 replies

froystig
Aug 11, 2022
Maintainer

dlwh Aug 11, 2022
Author

pschuh Aug 11, 2022
Collaborator

dlwh Aug 11, 2022
Author

froystig Aug 12, 2022
Maintainer

dlwh Aug 17, 2022
Author