Huge variations in computation time of Gram matrices with l2 distances #9057

eserie · 2021-12-27T17:25:03Z

eserie
Dec 27, 2021

Hello,
while trying to compute Gram matrices for kernel methods with JAX, I realized that the computation time can vary considerably, even for simple l2 distances.

Here are three different ways to compute the same thing:

import jax
import jax.numpy as jnp

X = jnp.ones((100, 100000))
rng = jax.random.PRNGKey(42)
Y = jax.random.normal(rng, X.shape)

@jax.jit
def gram1(X, Y): 
    return jax.vmap(lambda x: jax.vmap(lambda y : jnp.linalg.norm(x-y)**2)(Y))(X)

%timeit dist1 = gram1(X, Y)

@jax.jit
def gram2(X, Y): 
    return jax.vmap(lambda x: jax.vmap(lambda y : (x-y)@(x-y))(Y))(X)

%timeit dist2 = gram2(X, Y)  

@jax.jit
def gram3(X, Y): 
    return jax.vmap(lambda x: jax.vmap(lambda y : (x@x + y@y - 2*x@y))(Y))(X)

%timeit dist3 = gram3(X, Y)

Then the outputs will be:

369 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.52 s ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
35.8 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it is obvious that the latest implementation is the fastest.
Does anyone have any idea why there are such large differences in computation time (x40 between the slowest and fastest)?

jakevdp · 2021-12-28T15:34:09Z

jakevdp
Dec 28, 2021
Maintainer

You can explore the underlying computations used by each of these approaches with jax.make_jaxpr:

with jax.disable_jit():  # for cleaner jaxprs
  print("gram1:")
  print(jax.make_jaxpr(gram1)(X, Y))

  print("\ngram2:")
  print(jax.make_jaxpr(gram2)(X, Y))

  print("\ngram3:")
  print(jax.make_jaxpr(gram3)(X, Y))

Output:

gram1:
{ lambda ; a:f32[100,100000] b:f32[100,100000]. let
    c:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(0, 2)
      shape=(100, 100, 100000)
    ] a
    d:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(1, 2)
      shape=(100, 100, 100000)
    ] b
    e:f32[100,100,100000] = sub c d
    f:f32[100,100,100000] = mul e e
    g:f32[100,100] = reduce_sum[axes=(2,)] f
    h:f32[100,100] = sqrt g
    i:f32[100,100] = integer_pow[y=2] h
  in (i,) }

gram2:
{ lambda ; a:f32[100,100000] b:f32[100,100000]. let
    c:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(0, 2)
      shape=(100, 100, 100000)
    ] a
    d:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(1, 2)
      shape=(100, 100, 100000)
    ] b
    e:f32[100,100,100000] = sub c d
    f:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(0, 2)
      shape=(100, 100, 100000)
    ] a
    g:f32[100,100,100000] = broadcast_in_dim[
      broadcast_dimensions=(1, 2)
      shape=(100, 100, 100000)
    ] b
    h:f32[100,100,100000] = sub f g
    i:f32[100,100] = dot_general[
      dimension_numbers=(((2,), (2,)), ((0, 1), (0, 1)))
      precision=None
      preferred_element_type=None
    ] e h
  in (i,) }

gram3:
{ lambda ; a:f32[100,100000] b:f32[100,100000]. let
    c:f32[100] = dot_general[
      dimension_numbers=(((1,), (1,)), ((0,), (0,)))
      precision=None
      preferred_element_type=None
    ] a a
    d:f32[100] = dot_general[
      dimension_numbers=(((1,), (1,)), ((0,), (0,)))
      precision=None
      preferred_element_type=None
    ] b b
    e:f32[100,100] = broadcast_in_dim[
      broadcast_dimensions=(1,)
      shape=(100, 100)
    ] d
    f:f32[100,1] = reshape[dimensions=None new_sizes=(100, 1)] c
    g:f32[100,100] = add f e
    h:f32[100,100000] = mul a 2.0
    i:f32[100,100] = dot_general[
      dimension_numbers=(((1,), (1,)), ((), ()))
      precision=None
      preferred_element_type=None
    ] h b
    j:f32[100,100] = sub g i
  in (j,) }

You can see in the jaxprs that gram1 and gram2 contain instructions to create temporary arrays of size [100,100,100000], with gram1 probably faster by avoiding a dot_general call on these large matrices. On the other hand, gram3 avoids the large matrices altogether.

Now it's true that in general XLA compilation may re-arrange these kinds of computations to avoid unnecessary allocation, but it appears that for this particular compilation the compiler does not automatically reduce the operation to the more efficient form.

I suspect the reason this kind of rewrite is not written into the compiler is that the gram3 approach introduces numerical issues when computed using floating point arithmetic. So despite its relative speed I would probably avoid the approach. Here is a simple example of the issue:

import numpy as np
x = np.float32(1E8)
y = x + 1

print((y - x) * (y - x))
# 1.0

print(y ** 2 + x ** 2 - 2 * y * x)
# 0.0

3 replies

eserie Dec 28, 2021
Author

Thanks @jakevdp for your very clear explanation!
I'm not used to using jax.make_jaxpr but I'll do it now!
For your last point about potential numerical issues, if I am in a context where I have good input control, I may be able to use the faster implementation.

jakevdp Dec 28, 2021
Maintainer

OK, I hope that by "good input control" what you mean is "I know a priori that no pair of entries exist whose difference in value is much smaller than their absolute value"; if that's the case then you should be fine, but it strikes me as difficult to guarantee in general.

eserie Dec 28, 2021
Author

Certainly you're right, plus here there is a sum on a large dimension.
I will conduct experiments to see in practice the impact of the different options on my final results. Thanks for the advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huge variations in computation time of Gram matrices with l2 distances #9057

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Huge variations in computation time of Gram matrices with l2 distances #9057

Uh oh!

Uh oh!

eserie Dec 27, 2021

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

jakevdp Dec 28, 2021 Maintainer

Uh oh!

eserie Dec 28, 2021 Author

Uh oh!

jakevdp Dec 28, 2021 Maintainer

Uh oh!

eserie Dec 28, 2021 Author

eserie
Dec 27, 2021

Replies: 1 comment 3 replies

jakevdp
Dec 28, 2021
Maintainer

eserie Dec 28, 2021
Author

jakevdp Dec 28, 2021
Maintainer

eserie Dec 28, 2021
Author