single input is slower than vmap? #15134

dui1234 · 2023-03-22T10:51:07Z

dui1234
Mar 22, 2023

I have the following code that operates on a tree-like data structure, trees_Ns in my code:

import jax.numpy as jnp
import jax
from jax import jit, vmap, lax, random
from jax.experimental import checkify
from functools import partial
import time

@jit
def stack_p(params):
    return jax.tree_map(lambda *x: jnp.stack(x), *params)

def construct_tree(size, init_value, alpha: float = 0.6):
    tree_capacity = 1
    while tree_capacity < size: tree_capacity *= 2
    return [jnp.full(shape = (2 * tree_capacity), fill_value=init_value),
            tree_capacity, 
            size, 
            alpha, 
            1.0, 
            0] 

def construct_trees(size, alpha: float = 0.6):
    return [construct_tree(size, 0.0, alpha), construct_tree(size, float("inf"), alpha)]

@partial(jit,static_argnums = (0))
def set_tree_value(function,tree,idx,value):
    idx += tree[1]
    tree[0] = tree[0].at[idx].set(value)
    idx //= 2
    cond_fun = lambda x: x[0]>= 1
    def body_fun(idx_t):
        tree = idx_t[1].at[idx_t[0]].set(function(idx_t[1][2 * idx_t[0]], idx_t[1][2 * idx_t[0] + 1]))
        idx = idx_t[0]//2
        return [idx,tree]
    dmy = lax.while_loop(cond_fun, body_fun, [idx,tree[0]])
    tree[0] = dmy[1]
    return tree

@jit
def set_trees_value(trees):
    trees[0] = set_tree_value(lax.add,trees[0],trees[0][-1],trees[0][-2]**trees[0][3])
    trees[0][-1] = (trees[0][-1] + 1) % trees[0][2]
    trees[1] = set_tree_value(lax.min,trees[1],trees[1][-1],trees[1][-2]**trees[1][3])
    trees[1][-1] = (trees[1][-1] + 1) % trees[1][2]
    return trees

jv_set_trees_value = jit(vmap(set_trees_value))

trees = [construct_trees(1000, 0.2)]
trees_Ns = stack_p(trees)
for i in range(2000): trees_Ns = vmap(set_trees_value)(trees_Ns)

seed = 999
key = random.PRNGKey(seed)
priority_dmy = random.normal(key,(1, 100, 1))
indx = [jnp.sort(random.choice(key,jnp.arange(1000),(100,)))]

@jit
def update_priorities2(trees, indices, priorities):
    f2 = lambda x:x>0
    checked_f2 = checkify.checkify(f2, errors=checkify.all_checks)
    err = vmap(checked_f2)(priorities)
    max_prior = jnp.max(lax.max(priorities,trees[0][4]))
    trees[0][-2] = max_prior
    trees[1][-2] = max_prior
    
    for idx, priority in zip(indices, priorities.reshape(-1)):
        trees[0] = set_tree_value(lax.add,trees[0],idx,priority**trees[0][3])
        trees[1] = set_tree_value(lax.min,trees[1],idx,priority**trees[1][3])
    
    return trees 

v_update_tree = vmap(update_priorities2)
j_update_tree = jit(vmap(update_priorities2))

tree_dmy = [jax.tree_map(lambda x:x[i], trees_Ns) for i in range(1)]
a = time.time()
update_priorities2(tree_dmy[0],stack_p(indx)[0],priority_dmy[0])
time.time() - a
#results 0.005

c = time.time()
j_update_tree(trees_Ns,stack_p(indx),priority_dmy)
time.time() - c
#results 0.001

d = time.time()
v_update_tree(trees_Ns,stack_p(indx),priority_dmy)
time.time() - d
#results 0.007

My questions are:

Why the non-vmap function, i.e. update_priorities2 is slower than that using the vmap, i.e. j_update_tree?
Are there any smarter ways to accelerate the update_priorities2 function that contains a for loop? This for loop is needed as the update of the tree's values need to be performed in order following the indices. Also, this function's perfomance highly scales with the length of both indices and priorities. Are there any alternatives of constructing the function that is not scaled with the length of the input, like using vmap?

jakevdp · 2023-03-22T20:19:58Z

jakevdp
Mar 22, 2023
Maintainer

First off, for running microbenchmarks in JAX, be sure to follow the recommendations at FAQ: Benchmarking JAX Code. I found the following timings for your calls on a Colab CPU:

_ = update_priorities2(tree_dmy[0],stack_p(indx)[0],priority_dmy[0])  # compile
%timeit jax.block_until_ready(update_priorities2(tree_dmy[0],stack_p(indx)[0],priority_dmy[0]))
# 1.49 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

_ = j_update_tree(trees_Ns,stack_p(indx),priority_dmy)  # compile
%timeit jax.block_until_ready(j_update_tree(trees_Ns,stack_p(indx),priority_dmy))
# 312 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

_ = v_update_tree(trees_Ns,stack_p(indx),priority_dmy)  # compile
%timeit jax.block_until_ready(v_update_tree(trees_Ns,stack_p(indx),priority_dmy))
# 2.96 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So it looks like the vmap version truly is faster. Looking at the code, my guess is the reason this is faster is that the while loop is vmapped, meaning that there is only a single while loop over batched operations rather than multiple while loops called in sequence.

1 reply

dui1234 Mar 23, 2023
Author

Thanks for your reply. One more question: I replaced the for loop inside with lax.scan and run the test. It looks like there is no difference in terms of performance. What should I do in this case as my actual code is suffering from the massive slow down when the lenth of indices and priorities increase....

@jit
def update_priorities2_new(trees, indices, priorities):
    f2 = lambda x:x>0
    checked_f2 = checkify.checkify(f2, errors=checkify.all_checks)
    err = vmap(checked_f2)(priorities)
    max_prior = jnp.max(lax.max(priorities,trees[0][4]))
    trees[0][-2] = max_prior
    trees[1][-2] = max_prior

    def dmy_update(trees, idx_pri):
        idx, priority = idx_pri
        trees[0] = set_tree_value(lax.add,trees[0],idx,priority**trees[0][3])
        trees[1] = set_tree_value(lax.min,trees[1],idx,priority**trees[1][3])
        return trees, trees
    
    r,a = lax.scan(dmy_update,init = trees, xs = (indices,priorities.reshape(-1)))
    return r

v_update_tree_new = vmap(update_priorities2_new)
j_update_tree_new = jit(vmap(update_priorities2_new))

_ = update_priorities2_new(tree_dmy[0],stack_p(indx)[0],priority_dmy[0])  # compile
%timeit jax.block_until_ready(update_priorities2(tree_dmy[0],stack_p(indx)[0],priority_dmy[0]))
#1.63 ms ± 526 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

_ = j_update_tree_new(trees_Ns,stack_p(indx),priority_dmy)  # compile
%timeit jax.block_until_ready(j_update_tree(trees_Ns,stack_p(indx),priority_dmy))
#462 µs ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

_ = v_update_tree_new(trees_Ns,stack_p(indx),priority_dmy)  # compile
%timeit jax.block_until_ready(v_update_tree(trees_Ns,stack_p(indx),priority_dmy))
#3.37 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Very much obliged for you helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

single input is slower than vmap? #15134

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

single input is slower than vmap? #15134

Uh oh!

dui1234 Mar 22, 2023

Replies: 1 comment · 1 reply

Uh oh!

jakevdp Mar 22, 2023 Maintainer

Uh oh!

dui1234 Mar 23, 2023 Author

dui1234
Mar 22, 2023

Replies: 1 comment 1 reply

jakevdp
Mar 22, 2023
Maintainer

dui1234 Mar 23, 2023
Author