Help interpreting OOM (possibly pjit bug, or vmap spmd_axis_name bug?) #11943

dlwh · 2022-08-16T18:11:47Z

dlwh
Aug 16, 2022

Following up on my last discussion (#11798) where it seemed that maybe pjit wasn't splitting over the batch axis, I increased my model size and started getting OOMs again. This time I saved some of the logs from xla (see below). I'm pretty sure this is a bug in either pjit or the new vmap spmd_axis_name, but would like to check.

As a reminder, my code looks vaguely like:

def transformer(...):
    # some stuff
    dropout_mask = random.bernoulli(key, 1 - dropout_prob, x.shape)

transformer = vmap(transformer, ...) # add batch dim

transformer = pjit(transformer, in_axis_resources=...)

For context, I'm using a V3-256, with a pjit mesh of shape [128, 2], with the first axis being the "DATA" or "batch" axis and the second being for model partitioning. The model is partitioned along the hidden states.

I'm pretty sure these are either dropout masks or RNG states that aren't getting sharded. in the below,

jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Ran out of memory in memory space hbm. Used 24.54G of 15.48G hbm. Exceeded hbm capacity by 9.06G.
Total hbm usage >= 25.06G:
    reserved        530.00M
    program          24.54G
    arguments            0B
Output size 0B; shares 0B with arguments.
Program hbm requirement 24.54G:
    global           372.0K
    scoped            4.87M
    HLO temp         24.54G (99.7% utilization: Unpadded (23.98G) Padded (24.04G), 2.0% fragmentation (509.60M))
  Largest program allocations in hbm:
  1. Size: 400.00M
     Operator: op_name="pjit(train_step)/jit(main)/while/body/add" source_file="/files/levanter/src/levanter/nn/static_dropout.py" source_line=59
     Shape: u32[128,819200]{0,1:T(8,128)}
     Unpadded size: 400.00M
     XLA label: fusion.33230 = fusion(get-tuple-element.47770, get-tuple-element.47771, get-tuple-element.47772, fusion.33084, ...(+15)), kind=kLoop, calls=fused_computation.11237.clone
     Allocation type: HLO temp

This is followed by 19 additional identical allocations, presumably corresponding to the layers of the transformer blocks. (Confusingly I do have gradient checkpointing turned on, so I would have thought that you wouldn't get more than one of these anyway...)

The line being identified is:

    def do_dropout(x: Array, p, key: "jax.random.PRNGKey" = None) -> Array:
        q = 1 - p
        mask = jrandom.bernoulli(key, q, x.shape)  # this line
        return jnp.where(mask, x / q, 0)

For the offending allocation, x semantically here has shape [seqlen, embed], where seqlen=1024, embed=1600, and the computation has been vmapped to a batch size of 128, via:

# model, data, key
compute_loss_vmap = vmap(compute_loss, in_axes=[None, 0, 0], spmd_axis_name=ResourceAxis.DATA)

and the model has been pjit'd so that the "embed" dim should be partitioned in half.

So, the tensors being allocated above are of shape [128, 819200], which I'm guessing is a flattening of [128, 1024, 800]: which indicates that the "embed" axis is being partitioned but the "data"/"batch" axis is not...

EDIT: I realize I wasn't clear what my question was. Is my interpretation of the shape of these tensors correct? And if so, I think this means it's at least a bug in pjit/vmap?

A secondary question is why it needs to allocate all these buffers if I'm using gradient checkpointing between every layer...

pschuh · 2022-08-16T23:34:53Z

pschuh
Aug 16, 2022
Collaborator

It may help to run this with the dump flags: XLA_FLAGS='--xla_dump_to=/tmp/output_folder/xla_dumps --xla_dump_hlo_pass_re=.*' and then look right after the spmd propagation pass for any unusual partition specs (most of them should have a leading data partitioning).

Since you mention embedding, I would note that there is a known performance issue with updating the embeddings which all-gathers the embedding updates before applying them because the scatter-add cannot be partitioned properly. The workaround normally for this is to wrap an xmap around the embedding layer with the flags: experimental_xmap_spmd_lowering and experimental_xmap_spmd_lowering_manual.

The oom message just prints out the top buffers regardless if their lifetime overlaps or not, so the same sized buffers for each layer are still created but with different lifetimes.

2 replies

dlwh Aug 17, 2022
Author

Thanks! I scoured it and it does seem like everything is properly sharded. I don't understand why I get such a massive overallocation with dropout on vs dropout off in this case, but I'll have to study it more carefully... (There's no easy way to log how much memory is going to be allocated to do an operation right?)

re: embeddings: I appreciate the tip, but I'm not totally sure I understand what you're saying. I get that I should turn on those flags. Are you saying that I should put an xmap inside of the pjitted code just for embeddings? I ran into some trouble early on with xmap plus gather not really working, and was told that xmap was dispreferred these days. (I looked at https://github.com/tensorflow/lingvo/ and they seem to be setting those flags without directly calling xmap anywhere.

pschuh Aug 17, 2022
Collaborator

In XLA, there is SPMDFullToShardShape and SPMDShardToFullShape which essentially disables and enables SPMD sharding for a section of the XLA code. You can trigger this using those flags + an xmap. There must be lingvo models somewhere that include an xmap deep inside the model. This allows writing only a critical piece of code using the xmap style explicitly manual spmd partitioning and then letting pjit take care of the rest of the model. Using xmap in this way may be made more difficult by your outer vmap.

It is possible that the RNG is not getting properly included in the remat. For example, the computation of the mask could be getting shared across the forward and backwards passes. This could blow up the memory.

One thing to consider is using lax.scan over your layers (if your layers are identical). This will make it much clearer what is getting carried between computations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help interpreting OOM (possibly pjit bug, or vmap spmd_axis_name bug?) #11943

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help interpreting OOM (possibly pjit bug, or vmap spmd_axis_name bug?) #11943

Uh oh!

Uh oh!

dlwh Aug 16, 2022

Replies: 1 comment · 2 replies

Uh oh!

pschuh Aug 16, 2022 Collaborator

Uh oh!

dlwh Aug 17, 2022 Author

Uh oh!

pschuh Aug 17, 2022 Collaborator

dlwh
Aug 16, 2022

Replies: 1 comment 2 replies

pschuh
Aug 16, 2022
Collaborator

dlwh Aug 17, 2022
Author

pschuh Aug 17, 2022
Collaborator