FSDP/ZeRO grad accumulation in the presence of vmapped/scanned layers #12799

dlwh · 2022-10-14T05:33:57Z

dlwh
Oct 14, 2022

Hi,

I have a transformer implementation that is working reasonably well except I can't quite get FSDP/ZeRO to work quite the way I want it to. So far things have been looking ok, until I really started poking at gradient accumulation, where things don't quite work out. The issue is that it seems like jax really wants to have the gradient for all layers computed on each node before it's willing to reduce them and scatter the gradients, when ideally this would proceed layer-wise to keep memory use under control.

It's of course a bit hard to decode everything that's going on, but I have e.g. a parameter that has shape [num_layers, num_heads, head_size, embed_size] which gets sharded as [None, None, Data] to get a zero-style setup where every row of the [data, model] grid has a shard of the parameters. Here, num_layers=48, num_heads=24, head_size=64, embed_size=1536. The num_layers was introduced via a vmap and is folded over via a scan in the model's forward call.
Batch size is 512, number of devices is 256, which is important because of the below. (Everything is gradient checkpointed, fwiw)

 1. Size: 9.00G
     Shape: bf16[48,256,2,24,64,6]{5,4,3,2,1,0:T(8,128)(2,1)}
     Unpadded size: 432.00M
     Extra memory due to padding: 8.58G (21.3x expansion)
     XLA label: copy.594 = copy(fusion.742)
     Allocation type: HLO temp

I'm pretty sure this corresponds to a "logical array" size of [num_layers, num_devices, batch_size/num_devices, num_heads, head_size, embed_dim/num_devices].

Now, I guess the real problem above is that I'm getting a blow up of 21.3x(!) for padding (other arrays that have analogous shapes don't blow up nearly this much), but in an ideal world this array wouldn't exist. It's a temporary that gets reduced to [48,24,64,6] (via mean), but I would prefer if I could make the reduction piecemeal by layer: there should be a single temporary [256,2,24,64,6] on each device that gets reused as each layer's gradient is computed, and another buffer of size [48,24,64,6] for the accumulated gradients.

Is there a way to make this happen? Happy to share code or more context!

Specifically the current attempt is at https://github.com/stanford-crfm/levanter/blob/fsdp/src/levanter/modeling_utils.py#L66 (I have an in-progress named tensors library I'm using there but it more or less does the obvious thing). An earlier attempt is at https://github.com/stanford-crfm/levanter/blob/main/src/levanter/modeling_utils.py#L65 which doesn't have the extreme blow up problem and works fine at this scale, but it stops working at the next scale I'm targeting.

(For the first linked implementation, I'm cribbing off of t5x a bit here: https://github.com/google-research/t5x/blob/main/t5x/trainer.py#L617 , but my attempts to copy it are what led to the massive array above. I'll note that in general t5x seems to avoid vmapping/scanning layers: cf https://github.com/google-research/t5x/blob/main/t5x/examples/decoder_only/network.py#L197 which makes me think that this doesn't work reliably.)

Answered by sholtodouglas

Oct 16, 2022

Hey David!

This is quite curious! For starters, that XLA decided the memory optimised layout for that tensor was to have a dimension with size 6 at idx 0, thus causing the blowout to 128 to match TPU MXU tile size. To help us look into that, is there any chance you could send through the unoptimised HLO? You can get it by running jit(f).lower(*args).as_text().

I've quickly made a minimal reproduction I've put below. It behaves sensibly when I don't vmap per example inside the microbatch, but when I do it induces a huge all-to-all before the gradients are computed. I'm still looking into exactly what occurs, but it looks quite similar to your issue. Can you check through quickly and see if…

View full answer

sholtodouglas · 2022-10-16T03:04:17Z

sholtodouglas
Oct 16, 2022

Hey David!

This is quite curious! For starters, that XLA decided the memory optimised layout for that tensor was to have a dimension with size 6 at idx 0, thus causing the blowout to 128 to match TPU MXU tile size. To help us look into that, is there any chance you could send through the unoptimised HLO? You can get it by running jit(f).lower(*args).as_text().

I've quickly made a minimal reproduction I've put below. It behaves sensibly when I don't vmap per example inside the microbatch, but when I do it induces a huge all-to-all before the gradients are computed. I'm still looking into exactly what occurs, but it looks quite similar to your issue. Can you check through quickly and see if this matches with everything you're trying to do in the computation? If you don't vmap per example within the microbatch, does it work?

Finally, a couple quick thoughts/questions:

Is your param sharding [None, None, Data] or [None, None, None, Data]? Just matching it up with the fact that embed_dim is / n_devices later but head_size is not
Can you try tree mapping the param sharding constraints to the result of the zeros_like call? In my testing this afternoon I found a bug where zeros_like and ones_like do not automatically copy the sharding of their initializer. This should be fine as you are summing it with a sharded grad output, but worth specifying.
scalable_t5x (same dir) is a good reference which uses scan_over_layers, I think it might be a historical artefact that the decoder t5x example doesn't. Internally, I've mostly used scanned layers.
I had a read through your two codebases - they're gorgeous!

import jax
jax.config.update('jax_array', True)  # required for jax<0.4.0
import jax.numpy as jnp
from jax.experimental.maps import Mesh
from jax.experimental.pjit import pjit, with_sharding_constraint
from jax.experimental.pjit import PartitionSpec as P
from jax.sharding import MeshPspecSharding
from functools import partial
import numpy as np

num_layers = 48
num_heads = 24
head_size = 64
embed_size = 1536

batch = 512
t = 8

qkv_sharding = P(None, None, 'data', None)
x_sharding = P('data', None, 'model')
o_sharding = P(None, None, 'model')

qkv = jnp.ones((num_layers, num_heads, head_size, embed_size), dtype = jnp.bfloat16)
o = jnp.ones((num_layers, num_heads, head_size, embed_size), dtype=jnp.bfloat16)
x = jnp.ones((batch, t, embed_size), dtype=jnp.bfloat16)

dp, mp = 8, 1
devices = np.reshape(jax.local_devices(), (dp, mp))
mesh = Mesh(devices, ('data', 'model'))
x = jax.device_put(x, jax.sharding.MeshPspecSharding(mesh, x_sharding))
qkv = jax.device_put(qkv, jax.sharding.MeshPspecSharding(mesh, qkv_sharding))
o = jax.device_put(o, jax.sharding.MeshPspecSharding(mesh, o_sharding))

params = (qkv, o)

VMAP_MICROBATCH = False

def fwd(params, x):

  @jax.checkpoint
  def layer(x, params):
    qkv, o = params
    if VMAP_MICROBATCH:
      y = jnp.einsum('te,hde->thd', x, qkv) 
      z = jnp.einsum('thd,hde->te', y, o)
    else:
      y = jnp.einsum('bte,hde->bthd', x, qkv) 
      z = jnp.einsum('bthd,hde->bte', y, o)

      # no ffn
      
    return z, None
    
  x, _ = jax.lax.scan(layer, x, params)

  return x

def loss_fn(params, x):
  x = fwd(params, x)
  l = jnp.mean(x)
  return l

def grad_fn(params, x):
  loss, grad = jax.value_and_grad(loss_fn)(params, x)
  return loss, grad

def accumulate_gradients_sharded(params,
                                 x, 
                                 f,
                                 per_device_parallelism,
                                 data_axis_size):
  batch_size = jnp.shape(x)[0] # 512
  microbatch_size = data_axis_size * per_device_parallelism # 8 * 4 = 32
  num_micro_steps = batch_size // microbatch_size # 512 // 32 = 16
  assert num_micro_steps * microbatch_size == batch_size

  loss = jnp.zeros(())
  grad = jax.tree_util.tree_map(jnp.zeros_like, params)

  x = x.reshape((num_micro_steps, microbatch_size) + x.shape[1:])
  x = with_sharding_constraint(x, PartitionSpec(None, 'data', *(None,) * (len(x.shape) - 2)))

  # compute microbatches
  def loop(accum, microbatch):
    with jax.named_scope('microbatch'):
      loss, grad = accum 
      if VMAP_MICROBATCH:
        # vmap as code is written for single examples
        this_loss, this_grad = jax.vmap(f, in_axes=(None, 0))(params, microbatch)
        # reduce along microbatch dimension
        this_loss = jnp.mean(this_loss)
        mean_along_microbatch = partial(jnp.mean, axis = 0)
        this_grad = jax.tree_map(mean_along_microbatch, this_grad)
      else:
        this_loss, this_grad = f(params, microbatch)

      with jax.named_scope('accumulate'):
        return (this_loss + loss, jax.tree_map(jnp.add, grad, this_grad)), None

  # loops over microbatches, accumulates
  accum = (loss, grad)
  accum, _ = jax.lax.scan(loop, accum, x)
  loss, grad = accum

  return loss/num_micro_steps, jax.tree_map(lambda x: x / num_micro_steps, grad)

pjit_fn = partial(accumulate_gradients_sharded, 
                  f=grad_fn,
                  per_device_parallelism=4,
                  data_axis_size=dp)

with mesh:
  loss, grad = pjit(pjit_fn)(params, x)
  loss.block_until_ready()

12 replies

dlwh Aug 14, 2023
Author

@sholtodouglas I ran into this problem again and I figured out the issue. Consider this somewhat updated/scaled-up/multihost-enabled version of your code. With the jit-with-shardings around the loss function (marked HERE), the scan-over-layers allocates a full gradient on each device ([num_layers, num_heads, head_size, embed_size] x 2), which OOMs. Without the jit, the code runs (it's very slow, but it runs.)

The jit has to have shardings, otherwise it's fine

Seems like a bug/limitation?

import jax

import jax.numpy as jnp
from jax.experimental.pjit import pjit, with_sharding_constraint
from jax.sharding import NamedSharding, Mesh, PartitionSpec as P
from functools import partial
import numpy as np

num_layers = 32
head_size = 64
embed_size = 9600
num_heads = embed_size//head_size

batch = 512
t = 2048


dp, mp = len(jax.devices()), 1
devices = np.reshape(jax.devices(), (dp, mp))
mesh = Mesh(devices, ("data", "model"))

qkv_sharding = NamedSharding(mesh, P(None, None, "data", None))
o_sharding = NamedSharding(mesh, P(None, None, "model"))
x_sharding = NamedSharding(mesh, P("data", None, "model"))


@partial(jax.jit, static_argnums=(), out_shardings=(qkv_sharding, o_sharding, x_sharding))
def distributed_init():
    qkv = jnp.ones((num_layers, num_heads, head_size, embed_size), dtype=jnp.bfloat16)
    o = jnp.ones((num_layers, num_heads, head_size, embed_size), dtype=jnp.bfloat16)
    x = jnp.ones((batch, t, embed_size), dtype=jnp.bfloat16)

    return qkv, o, x


qkv, o, x = distributed_init()
print(qkv.nbytes / 1e9, o.nbytes / 1e9)
params = (qkv, o)

VMAP_MICROBATCH = False


def fwd(params, x):
    @jax.checkpoint
    def layer(x, params):
        qkv, o = params
        if VMAP_MICROBATCH:
            y = jnp.einsum("te,hde->thd", x, qkv)
            z = jnp.einsum("thd,hde->te", y, o)
        else:
            y = jnp.einsum("bte,hde->bthd", x, qkv)
            z = jnp.einsum("bthd,hde->bte", y, o)

            # no ffn

        return z, None

    x, _ = jax.lax.scan(layer, x, params)

    return x


# LOOK HERE
@partial(jax.jit, in_shardings=((qkv.sharding, o.sharding), (x.sharding)))
def loss_fn(params, x):
    x = fwd(params, x)
    l = jnp.mean(x)
    return l


def grad_fn(params, x):
    loss, grad = jax.value_and_grad(loss_fn)(params, x)
    return loss, grad


def accumulate_gradients_sharded(params, x, f, per_device_parallelism, data_axis_size):
    batch_size = jnp.shape(x)[0]  # 512
    microbatch_size = data_axis_size * per_device_parallelism  # 8 * 4 = 32
    num_micro_steps = batch_size // microbatch_size  # 512 // 32 = 16
    assert num_micro_steps * microbatch_size == batch_size

    loss = jnp.zeros(())
    grad = jax.tree_util.tree_map(jnp.zeros_like, params)

    x = x.reshape((num_micro_steps, microbatch_size) + x.shape[1:])
    x = with_sharding_constraint(x, P(None, "data", *(None,) * (len(x.shape) - 2)))

    # compute microbatches
    def loop(accum, microbatch):
        with jax.named_scope("microbatch"):
            loss, grad = accum
            if VMAP_MICROBATCH:
                # vmap as code is written for single examples
                this_loss, this_grad = jax.vmap(f, in_axes=(None, 0))(params, microbatch)
                # reduce along microbatch dimension
                this_loss = jnp.mean(this_loss)
                mean_along_microbatch = partial(jnp.mean, axis=0)
                this_grad = jax.tree_map(mean_along_microbatch, this_grad)
            else:
                this_loss, this_grad = f(params, microbatch)

            with jax.named_scope("accumulate"):
                return (this_loss + loss, jax.tree_map(jnp.add, grad, this_grad)), None

    # loops over microbatches, accumulates
    accum = (loss, grad)
    accum, _ = jax.lax.scan(loop, accum, x)
    loss, grad = accum

    return loss / num_micro_steps, jax.tree_map(lambda x: x / num_micro_steps, grad)


pjit_fn = partial(accumulate_gradients_sharded, f=grad_fn, per_device_parallelism=4, data_axis_size=dp)

with mesh:
    loss, grad = pjit(pjit_fn, donate_argnums=0, out_shardings=(None, (qkv_sharding, o_sharding)))(params, x)
    loss.block_until_ready()

With jit uncommented:

1. Size: 5.57G
     Operator: op_name="pjit(<unnamed wrapped function>)/jit(main)/while/body/microbatch/pjit[in_shardings=(GSPMDSharding({devices=[1,1,32,1]0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}), GSPMDSharding({replicated}), GSPMDSharding({devices=[32,1,1]0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31})) out_shardings=(UnspecifiedValue, UnspecifiedValue) resource_env=None donated_invars=(False, False, False) name=loss_fn keep_unused=False inline=False]" source_file="/home/dlwh/jit_test.py" source_line=71
     Shape: bf16[32,150,64,9600]{3,1,2,0:T(8,128)(2,1)}
     Unpadded size: 5.49G
     Extra memory due to padding: 75.00M (1.0x expansion)
     XLA label: copy.29 = copy(param.14), sharding={replicated}
     Allocation type: HLO temp
     ==========================

  2. Size: 5.49G
     Operator: op_name="pjit(<unnamed wrapped function>)/jit(main)/while/body/microbatch/transpose(microbatch)/jvp(jit(loss_fn))/while[cond_nconsts=0 body_nconsts=3]" source_file="/home/dlwh/jit_test.py" source_line=58
     Shape: bf16[32,150,64,9600]{3,0,1,2:T(8,128)(2,1)}
     Unpadded size: 5.49G
     XLA label: all-gather.6.remat.1.remat = all-gather(copy.66), channel_id=51, replica_groups={{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}}, dimensions={2}, use_global_device_ids=true
     Allocation type: HLO temp
     ==========================

AllanYangZhou Nov 29, 2023

Hi David, did you ever figure out a workaround for this? I think I have a similar issue.

dlwh Nov 30, 2023
Author

I actually ran into this again not that long ago. The issue for me was that I put jit inside of another jit, and somehow that's enough to break it. (That is, jit(grad(jit(f))))

dlwh Nov 30, 2023
Author

looking at your code, i don't think that's happening, but that's what got me... I think that XLA must be looking for a fairly specific configuration, and if you subvert its expectations it doesn't do what you want.

AllanYangZhou Nov 30, 2023

Got it, thanks. Yeah mine might just be a checkpointing thing.

FSDP/ZeRO grad accumulation in the presence of vmapped/scanned layers #12799

Uh oh!

Uh oh!

dlwh Oct 14, 2022

Replies: 1 comment · 12 replies

Uh oh!

Uh oh!

sholtodouglas Oct 16, 2022

Uh oh!

Uh oh!

dlwh Aug 14, 2023 Author

Uh oh!

AllanYangZhou Nov 29, 2023

Uh oh!

dlwh Nov 30, 2023 Author

Uh oh!

dlwh Nov 30, 2023 Author

Uh oh!

AllanYangZhou Nov 30, 2023

dlwh
Oct 14, 2022

Replies: 1 comment 12 replies

sholtodouglas
Oct 16, 2022

dlwh Aug 14, 2023
Author

dlwh Nov 30, 2023
Author

dlwh Nov 30, 2023
Author