Defrag hangs when using devicearray-based dataloader #12620

lucaslingle · 2022-10-02T23:53:22Z

lucaslingle
Oct 2, 2022

Hello,

I've noticed an interesting phenomenon when trying to run jax.lib.xla_bridge.get_backend().defragment() on TPU v3-8, as suggested here, and was wondering if anyone here might have any insights as to what caused it.

Dataloader 1

I have a dataloader for sequences that uses jax for shuffling, which can be simplified to

(toggle to show)

def epoch_iter_v1(tensor, batch_size, sequence_len, rng):
    tokens_per_batch = batch_size * sequence_len
    tensor = tensor[0 : tokens_per_batch * (tensor.shape[0] // tokens_per_batch)]
    tensor_inp = tensor[:-tokens_per_batch]
    tensor_tgt = tensor[1 : -tokens_per_batch + 1]
    n_token = tensor_inp.shape[0]
    n_sequence = n_token // sequence_len
    tensor_inp = jnp.reshape(tensor_inp, [n_sequence, sequence_len])
    tensor_tgt = jnp.reshape(tensor_tgt, [n_sequence, sequence_len])
    tensor_inp = jax.random.permutation(rng, tensor_inp, axis=0)
    tensor_tgt = jax.random.permutation(rng, tensor_tgt, axis=0)
    for idx in range(0, n_sequence, batch_size):
        inputs = tensor_inp[idx : idx + batch_size].astype(jnp.int32)
        targets = tensor_tgt[idx : idx + batch_size].astype(jnp.int32)
        yield {"inputs": inputs, "targets": targets}

def data_iter_v1(tensor, batch_size, sequence_len, infinite, rng):
    epoch_done = False
    while infinite or not epoch_done:
        rng, sk = jax.random.split(rng)
        for batch in epoch_iter_v1(tensor, batch_size, sequence_len, sk):
            yield batch
        epoch_done = True

When I use it, my training loop eventually runs into an issue with memory fragmentation at the beginning of the second epoch (judging by the step number). To resolve this, I rewrote my training loop to call defrag in a try-except statement wrapped around a train_op, which is pmapped.

(toggle to show)

try:
    train_state, metrics = train_op(
        train_state=train_state,
        batch=common_utils.shard(batch),
        rng=common_utils.shard_prng_key(sk),
     )
except ValueError as e:
    mem_issue = str(e).startswith("RESOURCE_EXHAUSTED")
    frag_issue = "compaction will enable this reservation" in str(e)
    if mem_issue and frag_issue:
         jax.lib.xla_bridge.get_backend().defragment()
         train_state, metrics = train_op(
             train_state=train_state,
             batch=common_utils.shard(batch),
             rng=common_utils.shard_prng_key(sk),
         )
    else:
        raise e

Unfortunately, this call to defrag causes the script to hang for upwards of half an hour. Defrag calls placed elsewhere run in about 30 seconds, but they do not resolve this issue, which occurs on the first training step after data_iter_v1 calls epoch_iter_v1 to create a generator for the second epoch (see first code snippet).

Dataloader 2

I wrote another version of my dataloader that calls np.asarray after permuting, thus converting them from jax.DeviceArrays back to numpy arrays.

(toggle to show)

def epoch_iter_v2(tensor, batch_size, sequence_len, rng):
    tokens_per_batch = batch_size * sequence_len
    tensor = tensor[0 : tokens_per_batch * (tensor.shape[0] // tokens_per_batch)]
    tensor_inp = tensor[:-tokens_per_batch]
    tensor_tgt = tensor[1 : -tokens_per_batch + 1]
    n_token = tensor_inp.shape[0]
    n_sequence = n_token // sequence_len
    tensor_inp = np.reshape(tensor_inp, [n_sequence, sequence_len])
    tensor_tgt = np.reshape(tensor_tgt, [n_sequence, sequence_len])
    tensor_inp = np.asarray(jax.random.permutation(rng, tensor_inp, axis=0))
    tensor_tgt = np.asarray(jax.random.permutation(rng, tensor_tgt, axis=0))
    for idx in range(0, n_sequence, batch_size):
        inputs = tensor_inp[idx : idx + batch_size].astype(np.int32)
        targets = tensor_tgt[idx : idx + batch_size].astype(np.int32)
        yield {"inputs": inputs, "targets": targets}


def data_iter_v2(tensor, batch_size, sequence_len, infinite, rng):
    epoch_done = False
    while infinite or not epoch_done:
        rng, sk = jax.random.split(rng)
        for batch in epoch_iter_v2(tensor, batch_size, sequence_len, sk):
            yield batch
        epoch_done = True

In this case, defrag is not even needed, and my script runs fine. However, I am wondering why this made such a difference. The two main DeviceArrays created by epoch_iter_v1, namely tensor_inp and tensor_tgt, presumably reside in TPU VM host memory, not accelerator memory, so I am not sure why the defrag time was so high. In my understanding, the only additional tensors in accelerator memory would be those moved there by common_utils.shard(batch).

I was also monitoring the total number of "live buffers" and their total size with jax.lib.xla_bridge.get_backend().live_buffers() and these numbers stay constant in former case prior to the script hanging, as well as in the latter case for the entirety of training. However, I observed four fewer live buffers in the latter case.

So my questions are...

Is there any evident reason why the defrag should take so long in the former case?
Why are there four additional live buffers in the former case, given that the batch contains only two DeviceArrays?
Are the parent arrays tensor_inp and tensor_tgt somehow passed to accelerator memory?

Thanks in advance for any insights you can share!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Defrag hangs when using devicearray-based dataloader #12620

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Defrag hangs when using devicearray-based dataloader #12620

Uh oh!

lucaslingle Oct 2, 2022

Dataloader 1

Dataloader 2

Replies: 0 comments

lucaslingle
Oct 2, 2022