Fastest way to "transpose" list of jax arrays on gpu #9848

frmetz · 2022-03-11T15:09:31Z

frmetz
Mar 11, 2022

Hi all,

I want to convert a list of lists of jax arrays into a list of jax arrays where the first two dimensions are swapped, i.e. something like new_list[i][j] = old_list[j][i]. I have a working code snippet (see version_1 below) which is running quite fast on a cpu. However, it is super slow on a gpu and now I am thinking of alternative implementations that will run as fast on the gpu (or faster). So far the best I came up with is version_3, which doesn't look pretty though and I am wondering if there is something smarter I can do?

Here are the details:

import time
import numpy as np
import jax
import jax.numpy as jnp

D = 8
tensor_list = [jnp.ones((2**i,2,2**(i+1))) for i in range(D)] + [jnp.ones((2**(i+1),2,2**i)) for i in reversed(range(D))]

# this is the list I start from (note that each jax array in this list of lists is a ndim=3 tensor, however with different shapes)
batch_list = [tensor_list] * batch_size

# this is the list I want to end up with
final_list = [jnp.ones((batch_size, t.shape[0], t.shape[1], t.shape[2]), dtype=t.dtype) for t in batch_list[0]]

# original approach which is fast on cpu, but slow on gpu because presumably I copy from gpu (jax array) to cpu (numpy array)
def version_1(batch_list, final_list):
    for i,t_list in enumerate(batch_list):
        for j,t in enumerate(t_list):
            final_list[j][i] = t
            # equivalent to:
            # final_list[j][i] = batch_list[i][j]
    return final_list

# what I would naively/intuitively do, but it's still quite slow on the gpu
@jit
def version_2(batch_list, final_list):
    for i,t_list in enumerate(batch_list):
        for j,t in enumerate(t_list):
            final_list[j] = final_list[j].at[i, :].set(t)
    return final_list

# fastest on the gpu, but I don't think it's pretty
@jit
def version_3(batch_list, final_list):
    for i,t_list in enumerate(batch_list):
        for j,t in enumerate(t_list):
            final_list[j][i] = t
    for j,t in enumerate(final_list):
        final_list[j] = jnp.array(t)
    return final_list

When comparing the individual approaches I get the following times:

final_list = [np.empty((batch_size, t.shape[0], t.shape[1], t.shape[2]), dtype=t.dtype) for t in batch_list[0]]
start_time = time.time()
final_list = version_1(batch_list, final_list)
print(f"Version 1: {time.time()-start_time}")


final_list = [jnp.empty((batch_size, t.shape[0], t.shape[1], t.shape[2]), dtype=t.dtype) for t in batch_list[0]]
final_list1 = version_2(batch_list, final_list) # for compiling
start_time = time.time()
final_list1 = version_2(batch_list, final_list)
final_list1[0].block_until_ready()
print("Version 2: {time.time()-start_time}")

final_list = [[jnp.empty((t.shape[0], t.shape[1], t.shape[2]), dtype=t.dtype)]*batch_size for t in batch_list[0]]
final_list1 = version_3(batch_list, final_list) # for compiling
start_time = time.time()
final_list1 = version_3(batch_list, final_list)
final_list1[0].block_until_ready()
print("Version 3: {time.time()-start_time}")

returns

Version 1: 0.005630 (on cpu)
Version 1: 0.039899 (on gpu)

Version 2: 0.008874 (on gpu)

Version 3: 0.002829 (on gpu)

Is there something better/faster I can do than version 3?
Thanks a lot!!

Answered by YouJiacheng

Mar 11, 2022

May you clarify what exactly you want to do?
Do you want to convert a x: list[list[ndarray]] with

assert (
len(x) == B and all(len(x[i]) == 2 * D for i in range(B)) and all(x[i][j].ndim == 3 for i in range(B) for j in range(2 * D))
and all(x[i][k].shape == x[j][k].shape for i in range(B) for j in range(B) for k in range(2 * D))
)

to y: list[ndarry] with

assert len(y) == 2 * D and all(y[i].ndim == 4 and y[i].shape[0] == B for i in range(2 * D))

And do you need to write into an exist y, or just need construct such an array?
Given such specification, I think it is natural to write a function similar to your version 3:
First transpose the nested list in python without manipulate array, which …

View full answer

YouJiacheng · 2022-03-11T16:14:58Z

YouJiacheng
Mar 11, 2022

May you clarify what exactly you want to do?
Do you want to convert a x: list[list[ndarray]] with

assert (
len(x) == B and all(len(x[i]) == 2 * D for i in range(B)) and all(x[i][j].ndim == 3 for i in range(B) for j in range(2 * D))
and all(x[i][k].shape == x[j][k].shape for i in range(B) for j in range(B) for k in range(2 * D))
)

to y: list[ndarry] with

assert len(y) == 2 * D and all(y[i].ndim == 4 and y[i].shape[0] == B for i in range(2 * D))

And do you need to write into an exist y, or just need construct such an array?
Given such specification, I think it is natural to write a function similar to your version 3:
First transpose the nested list in python without manipulate array, which has (nearly) zero runtime cost, since it just reorganize variable(array).
Then stack each batch into one array.
I think it is natural and pythonic.

import jax
import jax.numpy as jnp
B = 32
D = 8

x = [[*(jnp.ones((2**i,2,2**(i+1))) for i in range(D)), *(jnp.ones((2**(i+1),2,2**i)) for i in reversed(range(D)))] for _ in range(B)]

@jax.jit
def transpose(x: list[list[jnp.ndarray]]) -> list[jnp.ndarray]:
    z = [t for t in zip(*x)] # list[tuple[jnp.ndarray]]
    assert len(z) == 2 * D and len(z[0]) == B
    return [jnp.stack(t) for t in z]

@jax.jit
def transpose_simplified(x: list[list[jnp.ndarray]]) -> list[jnp.ndarray]:
    return [jnp.stack(t) for t in zip(*x)]

In one line, WDYT?
BTW, I find that your version 3 is not compatible with your final_list as input.
I modified final_list to compare performance:

final_list = [[None for _ in range(B)] for _ in range(2 * D)]
@jax.jit
def version_3(batch_list, final_list):
    for i,t_list in enumerate(batch_list):
        for j,t in enumerate(t_list):
            final_list[j][i] = t
    for j,t in enumerate(final_list):
        final_list[j] = jnp.array(t)
    return final_list

And result is (on V100 GPU):
transpose: 0.0007875030040740967
transpose_simplified: 0.0007875461101531983
version_3: 0.0008109297752380371
Benchmark code:

def timer(f: Callable[[], Any]):
    from time import time
    f() # warmup
    t = time()
    for _ in range(5000):
        f()
    print((time() - t) / 5000)

timer(lambda: jax.block_until_ready(transpose(x)))
timer(lambda: jax.block_until_ready(transpose_simplified(x)))
timer(lambda: jax.block_until_ready(version_3(x, final_list)))

0 replies

frmetz · 2022-03-12T10:54:14Z

frmetz
Mar 12, 2022
Author

Thanks so much YouJiacheng!
You understood my problem correctly and your one-line-solution definitely beats what I had before. This was exactly what I was looking for!

I have a quick follow-up question: Imagine I have two of these lists now (x1, x2) which I want transpose separately. They have equal lengths/shapes:

assert (
len(x1) == len(x2) and len(x1[0]) == len(x2[0])
and all(x1[i][k].shape == x2[i][k].shape for i in range(len(x1)) for k in range(len(x1[0])))
)

Can I do better than calling transpose_simplified(x) twice, i.e. for each of these lists separately? Or is there a faster way to transpose them both "at the same time" by only having one loop?

Thanks again!

1 reply

YouJiacheng Mar 12, 2022

I believe that there is (nearly) zero overhead to transpose each list separately.
For JIT function, JAX will flatten Pytree(nested list/tuple/dict etc.) to list first, and call the flattened&compiled version function. Thus the runtime cost (in contrast to compile cost) difference is: "flatten 2 lists at the same time, dispatch(call function) once" v.s "flatten 2 lists separately, dispatch twice". Through dispatch cost in JAX is relative high, but I don't think you need to worry about that.
Moreover, you can just write all transpose_simplified(without jit or jit with inline=True) call and other operations in one function, and jit this all-in-one function.

YouJiacheng · 2022-03-12T12:02:27Z

YouJiacheng
Mar 12, 2022

@frmetz
Oh, I forget a more general one-line-solution: not only applicable to list[list[ndarray]] -> list[ndarray], but also applicable to list[Pytree[ndarray]] -> Pytree[ndarray]. (Pytree can be nested list/tuple/dict)
Because you need is actually stacking a batch of lists/pytrees into one list/pytree with its elements/leaves having a leading batch axis:

@jax.jit
def stack(x: list):
    return jax.tree_map(lambda *xs: jnp.stack(xs), *x)

Thus, for you follow-up question, if you can easily zip your input(maybe not really zip, just construct them as), you can simply use:

y1, y2 = stack(list(zip(x1, x2)))

BTW, I think you should change your data pattern: directly produce tree of arrays with batch axis, since compiling of function with a large list input will be painfully slow. (batch size 256 takes 11 seconds to compile on my device)
You can use jax.vmap when data in a batch is independent, or use jax.lax.scan when dependency graph of data in a batch is a DAG (can translate to sequence with topological order).

For example:

def generate_one_data(z: float):
    return [z * jnp.ones((2 ** i, 2)) for i in range(D)]

generate_batch_data = jax.vmap(generate_one_data)
xs = generate_batch_data(jnp.ones((B,)))

assert len(xs) == D and all(xs[i].shape == (B, 2 ** i, 2) for i in range(D))

def generate_one_data_with_dependency(carry: float, z: float):
    return carry + z, [carry * jnp.ones((2 ** i, 2)) for i in range(D)]

def generate_batch_data_with_dependency(init: float, zs):
    return jax.lax.scan(generate_one_data_with_dependency, init, zs)[1]

ys = generate_batch_data_with_dependency(0.0, jnp.ones((B,)))

assert len(ys) == D and all(ys[i].shape == (B, 2 ** i, 2) for i in range(D))
assert all(y[i][0][0] == i for y in ys for i in range(B))

5 replies

frmetz Mar 12, 2022
Author

Thanks YouJiacheng!!

Unfortunately I think the last two tricks (using vmap and lax.scan) are not that straightforward in my case. I am doing reinforcement learning (deep Q-Learning) and the data points in my list correspond to different RL states that I continuously collect during training. I put each RL state (of the type list[ndarray]) into a cyclic replay buffer so essentially another list, i.e. my buffer is a list[list[ndarray]]. When I want to train my Q-network, I randomly sample a batch of those states which gives me a smaller list[list[ndarray]]. And this is the data structure I then want to stack together, so that I can use it for training.

I was also thinking of putting my RL states already into the right form when I add them to the replay buffer (similar to here). So instead of just having a simple list where I store all my states, I directly insert them into a replay buffer of the form:

xs_buffer = [jnp.zeros((buffer_size, 2 ** i, 2)) for i in range(D)]

and after the random sampling step, I automatically end up with what I want (where batch_size << buffer_size)

xs = [jnp.zeros((batch_size, 2 ** i, 2)) for i in range(D)]

But here I had the problem that adding many individual states to the xs_buffer one after another was extremely slow...

YouJiacheng Mar 12, 2022

@frmetz "adding many individual states to the xs_buffer one after another was extremely slow", but I think such cost is much smaller than one step RL simulation, or you use parallel RL environments and produce many states in each step?
In such case, I think you can try use (buffer_size // num_envs, num_envs, ...) shape cyclic buffer, and stack states into (num_env, ...) shape before store into cyclic buffer. I am not sure that whether this change can improve performance.
Important! Please make sure use cyclic buffer in a jit function, since all JAX operation outside jit must not be inplace thus "add one" will result in a full copy!!! If you use cyclic buffer outside jit, it will be extremely slow!
BTW, index_update in your link is deprecated. You should use at.
If you want to use cyclic buffer outside jit, you may follow this pattern:

from functools import partial

# Important! donate_argnums indicate that input buffer will not be used again, thus permit inplace update
@partial(jax.jit, donate_argnums=0)
def inplace_store(buffer, ptr, value):
    buffer = buffer.at[ptr].set(value)
    return buffer

class CyclicBuffer:
    def __init__(self, maxsize: int, shape: tuple[int]):
        self.buffer = jnp.zeros((maxsize, *shape))
        self.ptr, self.size, self.maxsize = 0, 0, maxsize

    def store(self, value):
        self.buffer = inplace_store(self.buffer, self.ptr, value)
        self.ptr = (self.ptr + 1) % self.maxsize
        self.size = min(self.size + 1, self.maxsize)

frmetz Mar 13, 2022
Author

Ok! Given that my values/states are lists of jax arrays, I would have modified the Cyclic buffer above as follows:

@partial(jax.jit, donate_argnums=0)
def inplace_store(buffer, ptr, value):
    return [buffer[l].at[ptr].set(value[l]) for l in range(len(value))]

class CyclicBuffer:
    def __init__(self, maxsize, example_value, batch_size):
        self.rng = jax.random.PRNGKey(11)
        self.batch_size = batch_size
        self.buffer = [jnp.empty((maxsize, s.shape[0], s.shape[1]), dtype=np.complex64) for s in example_value]
        self.ptr, self.maxsize = 0, maxsize

    def store(self, value):
        self.buffer = inplace_store(self.buffer, self.ptr, value)
        self.ptr = (self.ptr + 1) % self.maxsize

    @partial(jax.jit, static_argnums=(0,))
    def sample(self, buffer, rng):
        rng, rng_input = jax.random.split(rng)
        indexes = jax.random.randint(rng_input, shape=(self.batch_size,),
                                 minval=0, maxval=self.maxsize)
        return [jnp.stack(t[indexes]) for t in buffer], rng

In contrast, before I had:

from collections import deque
import random

@jit
def stack(x):
    return jax.tree_map(lambda *xs: jnp.stack(xs), *x)

class CyclicBuffer_before:
    def __init__(self, maxsize, state, batch_size):
        self.batch_size = batch_size
        self.buffer = deque(maxlen=maxsize)

    def store(self, value):
        self.buffer.append(value)

    def sample(self):
        batch = random.sample(self.buffer, self.batch_size)
        states = stack(batch)
        return states

In the first example each storing and sampling step is about equally as fast on my gpu (~0.5ms). In the second example, storing comes at essentially zero cost, but sampling is much more expensive (~2ms). So depending on how often I sample vs store data, one wins over the other.

BTW, is there a nicer expression for [buffer[l].at[ptr].set(value[l]) for l in range(len(value))]?

Also, I am not having parallel RL environments at the moment, but I keep your suggestion in mind for later.

And thank you! I really appreciate your help and insights!

YouJiacheng Mar 13, 2022

Update: BTW, I think you don't need a stack for the first version. (though jit should optimize it, it is confusing.)

@partial(jax.jit, static_argnums=(0,))
def sample(batch_size, buffer_tree, key):
        key, subkey= jax.random.split(key)
        indices = jax.random.randint(subkey, shape=(batch_size,),
                                 minval=0, maxval=self.maxsize)
        return jax.tree_map(lambda t: t[indices], buffer_tree), key

You can use jax.tree_map again, though not simpler or faster than list comprehension when values are organized in a plain list. (except when you unconsciously make a mistake, list comprehension may not raise an error)
But I think RLer may want to use a nested dict rather than a plain list, such as using a dict to separate observations, actions and rewards to make the code more readable and with less bug.
Thus the more general tree_map version can be preferred.

@partial(jax.jit, donate_argnums=0)
def inplace_store(buffer_tree, ptr, value_tree):
    return jax.tree_map(lambda buffer, value: buffer.at[ptr].set(value), buffer_tree, value_tree)

class CyclicBuffer:
    def __init__(self, maxsize, example_tree, batch_size):
        self.rng = jax.random.PRNGKey(11)
        self.batch_size = batch_size
        self.buffer = jax.tree_map(lambda example: jnp.zeros_like(example, shape=(maxsize, *example.shape)), example_tree)
        # JAX empty is just alias of zeros
        self.ptr, self.maxsize = 0, maxsize

frmetz Mar 14, 2022
Author

This was extremely helpful! (Changing the replay buffer according to your suggestion, reduced my whole training time by more than a factor of 2!)
Thank you so much @YouJiacheng !!

Fastest way to "transpose" list of jax arrays on gpu #9848

Uh oh!

frmetz Mar 11, 2022

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

YouJiacheng Mar 11, 2022

Uh oh!

frmetz Mar 12, 2022 Author

Uh oh!

YouJiacheng Mar 12, 2022

Uh oh!

Uh oh!

YouJiacheng Mar 12, 2022

Uh oh!

frmetz Mar 12, 2022 Author

Uh oh!

YouJiacheng Mar 12, 2022

Uh oh!

frmetz Mar 13, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Mar 13, 2022

Uh oh!

frmetz Mar 14, 2022 Author

frmetz
Mar 11, 2022

Replies: 3 comments 6 replies

YouJiacheng
Mar 11, 2022

frmetz
Mar 12, 2022
Author

YouJiacheng
Mar 12, 2022

frmetz Mar 12, 2022
Author

frmetz Mar 13, 2022
Author

frmetz Mar 14, 2022
Author