Is it possible to simultaneously perform computations on a TPU device and transfer new data onto it? #13011

ayaka14732 · 2022-10-27T16:39:22Z

ayaka14732
Oct 27, 2022

Previously in https://twitter.com/ayaka14732/status/1585629469938569219.

Imagine that there is a training loop like this (pseudo code):

for batch_cpu in batches_cpu:
    batch_tpu = device_put(batch_cpu, device_tpu)  # <- alpha
    loss = foward(params, batch_tpu)  # <- beta
    ...

It would be more efficient if we do this:

Is it possible to achieve this in JAX?

cgarciae · 2022-10-27T20:21:48Z

cgarciae
Oct 27, 2022
Collaborator

My intuition is that since JAX works async, you will get case 2 as a result if nothing blocks after forward.

0 replies

ayaka14732 · 2022-10-28T01:34:56Z

ayaka14732
Oct 28, 2022
Author

@cgarciae Thank you! However, I realised that there might be some blocking operations:

@jax.jit
def train_step(params, opt_states, batch_tpu):
    ...
    return params, opt_states, loss

...

for batch_cpu in batches_cpu:
    batch_tpu = device_put(batch_cpu, device_tpu)  # <- alpha
    params, opt_states, loss = train_step(params, opt_states, batch_tpu)  # <- beta
    ...
    wandb.log({'loss': loss.item()})  # <- blocking

1 reply

sharadmv Nov 7, 2022
Collaborator

A quick-and-dirty way to accomplish this might be to do something like:

jax.debug.callback(lambda loss: wandb.log({'loss': loss.item()}), loss)

since callbacks are executed on separate threads on TPU.

ayaka14732 · 2022-10-28T07:02:51Z

ayaka14732
Oct 28, 2022
Author

Also @mattjj @YouJiacheng

4 replies

mattjj Oct 28, 2022
Maintainer

Thanks for the question!

My understanding is that unless the code does something to cause blocking, we get the latter trace's behavior due to both JAX's async dispatch (meaning the main Python thread isn't blocked during dispatch) and the runtime's ability to transfer and execute computations at the same time (meaning a transfer doesn't block a computation and vice-versa).

However, as you wrote in your recent message, doing something like converting a JAX array (which with async dispatch really represents a 'future') to a Python builtin numeric type (like loss.item() would), which requires blocking the Python thread until we pull the value back to the host (and hence requires blocking until the value is computed), will cause blocking (in the Python thread).

I'm not sure what the best practice here is, and I'm not familiar with wandb.log, but one way to avoid blocking the main Python thread this way is to kick off a separate Python thread to wait for such values to be computed (without blocking the main Python thread) and only pull back to host and update the log it when it's finished. For example, we could have a logger thread which reads from a threadsafe queue, specifically by popping JAX arrays off it (blocking until a new one is available), then calling block_until_ready on the JAX array, then converting to a Python value and applying wandb.log to log the value. The main Python thread can then just enqueue things on the threadsafe queue.

WDYT?

mattjj Oct 28, 2022
Maintainer

(I suspect there's a standard way to do this, so I'm going to ask around and update here when I learn more.)

YouJiacheng Oct 28, 2022

I guess simply call loss.item() in a host callback can make it not block main thread, with data dependency handled by JAX runtime?

ayaka14732 Nov 6, 2022
Author

@YouJiacheng This is not possible because loss.item() results in a concrete Python native float value.

ayaka14732 · 2022-10-28T07:42:28Z

ayaka14732
Oct 28, 2022
Author

Besides, @Sea-Snell suggested https://flax.readthedocs.io/en/latest/_modules/flax/jax_utils.html#prefetch_to_device.

It seems that this is exactly what I want to achieve. However, the documentation also says:

This utility is mostly useful for GPUs, for TPUs and CPUs it should not be necessary -- the TPU & CPU memory allocators (normally) don't pick a memory location that isn't free yet so they don't block. Instead those allocators OOM.

I don't quite understand this paragraph. From my understanding, as long as the buffer size is reasonable (e.g. 2) and not too large, it will not cause OOM, so the function is still useful for TPU.

3 replies

young-geng Nov 10, 2022

My experience is that prefetch_to_device makes no difference at all if your training loop is non-blocking, and by default JAX host to device transfer is already non-blocking. If you take a look at the source code of that function, it is basically a thin wrapper around a synchronous python buffer and JAX's device_put_sharded, without any async implementation.

ayaka14732 Nov 10, 2022
Author

@young-geng This is an example of the training loop is blocking:

@jax.jit
def train_step(params, opt_states, batch_tpu):
    ...
    return params, opt_states, loss

...

for batch_cpu in batches_cpu:
    batch_tpu = device_put(batch_cpu, device_tpu)
    params, opt_states, loss = train_step(params, opt_states, batch_tpu)
    ...
    wandb.log({'loss': loss.item()})  # <- blocking
    # or
    print(loss.item())  # <- blocking

Do you mean that the best practise is that we should avoid blocking operations in the training loop? I am asking because this approach is different with implementing an data loader which handles prefetch_to_device automatically.

young-geng Nov 10, 2022

Yes, prefetch_to_device won't save you if you have blocking operations in the training loop. In my use cases, I only log every 500 iterations because of this. If you have to log every iteration, you might want to store the metrics in a buffer and only transfer once when you accumulate a large batch.

Is it possible to simultaneously perform computations on a TPU device and transfer new data onto it? #13011

Uh oh!

Uh oh!

ayaka14732 Oct 27, 2022

Replies: 4 comments · 8 replies

Uh oh!

cgarciae Oct 27, 2022 Collaborator

Uh oh!

ayaka14732 Oct 28, 2022 Author

Uh oh!

Uh oh!

sharadmv Nov 7, 2022 Collaborator

Uh oh!

ayaka14732 Oct 28, 2022 Author

Uh oh!

mattjj Oct 28, 2022 Maintainer

Uh oh!

mattjj Oct 28, 2022 Maintainer

Uh oh!

YouJiacheng Oct 28, 2022

Uh oh!

ayaka14732 Nov 6, 2022 Author

Uh oh!

Uh oh!

ayaka14732 Oct 28, 2022 Author

Uh oh!

young-geng Nov 10, 2022

Uh oh!

Uh oh!

ayaka14732 Nov 10, 2022 Author

Uh oh!

young-geng Nov 10, 2022

ayaka14732
Oct 27, 2022

Replies: 4 comments 8 replies

cgarciae
Oct 27, 2022
Collaborator

ayaka14732
Oct 28, 2022
Author

sharadmv Nov 7, 2022
Collaborator

ayaka14732
Oct 28, 2022
Author

mattjj Oct 28, 2022
Maintainer

mattjj Oct 28, 2022
Maintainer

ayaka14732 Nov 6, 2022
Author

ayaka14732
Oct 28, 2022
Author

ayaka14732 Nov 10, 2022
Author