Reduce functionality for vmap #9505

joeryjoery · 2022-02-09T15:24:59Z

joeryjoery
Feb 9, 2022

I was trying to compute a Gauss-Newton matrix and had some difficulties in aggregating the outer jacobians in an efficient manner, i.e., a sum of outer products generated by large parameter-vectors, over the entire dataset.

Essentially what this came down to was: I was looking for some aggregation functionality for the vmap results as the results would come in, as I was running out of memory with my naive implementation.

Of course, I know that I can split this computation up into batches and use a loop, which will be equivalent. But I was actually wondering how I could implement this in a more principled map-reduce procedure in Jax. I've tried playing around with jax.lax.reduce https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.reduce.html, however, I don't quite understand how to compose any non-trivial operators with this.

Below is some code that summarizes what I would like to do.

Unviable solution:

import jax
import jax.numpy as jnp

n = 200
d = 1000

a = jnp.arange(n * d).reshape((n, d))

# Works for slightly larger d
jnp.outer(a[0], a[0])

# Fails for slightly larger d
res = jax.vmap(jnp.outer)(a, a)
print(res.sum(axis=0))

Naive Viable:

@jax.jit
def outer_sum(arr):
    result = 0
    for a in arr:
        result += jnp.outer(a, a)
    return result

incremental_outer_sum = outer_sum(a)
print(incremental_outer_sum)

Python Map-Reduce: # Fastest

import functools
map_reduce = functools.reduce(jnp.add, (map(jnp.outer, a, a)))
print(map_reduce)

So, could a Jax primitive-based implementation pose a speed-up over the Python Map-Reduce functionality? And how would this work?

Answered by YouJiacheng

Feb 16, 2022

You just need to jit it, then XLA will optimize the computation for you. (However this cannot be observed with jax.xla_computation)

n = 200
d = 5000

a = jnp.ones((n, d), jnp.float32)

def f(x):
    return jnp.sum(jax.vmap(jnp.outer)(x, x), axis=0)

print(f(a)) # fail with n = 5000, d = 1000 or n = 200, d = 5000 (on 16G V100)
print(jax.jit(f)(a)) # succeed with n = 2000000, d = 1000 or n = 200, d = 60000 (on 16G V100)

jax.lax.reduce's semantic is just reduce over operands, the large intermediate array will not be eliminated if without optimization.
jax.lax.scan has map-reduce "semantic", but optimization may break it for higher parallelism when intermediate array should be store(for examp…

View full answer

joeryjoery · 2022-02-10T13:43:06Z

joeryjoery
Feb 10, 2022
Author

I just figured out that I can also be using jax.numpy.einsum for this:

einsum_result = jnp.einsum('abc,abd->bcd', a, a)
print(einsum_result)

Still, a reduce functionality would come in handy for other use cases.

0 replies

YouJiacheng · 2022-02-16T09:35:47Z

YouJiacheng
Feb 16, 2022

You just need to jit it, then XLA will optimize the computation for you. (However this cannot be observed with jax.xla_computation)

n = 200
d = 5000

a = jnp.ones((n, d), jnp.float32)

def f(x):
    return jnp.sum(jax.vmap(jnp.outer)(x, x), axis=0)

print(f(a)) # fail with n = 5000, d = 1000 or n = 200, d = 5000 (on 16G V100)
print(jax.jit(f)(a)) # succeed with n = 2000000, d = 1000 or n = 200, d = 60000 (on 16G V100)

jax.lax.reduce's semantic is just reduce over operands, the large intermediate array will not be eliminated if without optimization.
jax.lax.scan has map-reduce "semantic", but optimization may break it for higher parallelism when intermediate array should be store(for example in reverse-mode autodiff).

comparison with explicit map-reduce semantic

n = 200
d = 5000

a = jnp.ones((n, d), jnp.float32)

def f(x):
    return jnp.sum(jax.vmap(jnp.outer)(x, x), axis=0)

def g(xs): # explicit map-reduce, success even without jit
    def scan_fun(carry, x):
        return carry + jnp.outer(x, x), None
    out = jax.eval_shape(jnp.outer, xs[0], xs[0])
    return jax.lax.scan(scan_fun, jnp.zeros(out.shape, out.dtype), xs)[0]

print(jax.xla_computation(f)(a).as_hlo_text())
print(jax.xla_computation(g)(a).as_hlo_text())

You will see f's xla_computation is

primitive_computation_add.11 {
  constant.14 = pred[] constant(false)
  parameter.12 = f32[] parameter(0)
  parameter.13 = f32[] parameter(1)
  ROOT add.15 = f32[] add(parameter.12, parameter.13)
}

ENTRY xla_computation_f.18 {
  constant.2 = pred[] constant(false)
  parameter.1 = f32[200,5000]{1,0} parameter(0)
  broadcast.3 = f32[200,5000,1]{2,1,0} broadcast(parameter.1), dimensions={0,1}
  reshape.5 = f32[200,5000]{1,0} reshape(broadcast.3)
  broadcast.6 = f32[200,5000,5000]{2,1,0} broadcast(reshape.5), dimensions={0,1}
  broadcast.4 = f32[200,1,5000]{2,1,0} broadcast(parameter.1), dimensions={0,2}
  reshape.7 = f32[200,5000]{1,0} reshape(broadcast.4)
  broadcast.8 = f32[200,5000,5000]{2,1,0} broadcast(reshape.7), dimensions={0,2}
  multiply.9 = f32[200,5000,5000]{2,1,0} multiply(broadcast.6, broadcast.8)
  constant.10 = f32[] constant(0)
  reduce.16 = f32[5000,5000]{1,0} reduce(multiply.9, constant.10), dimensions={0}, to_apply=primitive_computation_add.11
  ROOT tuple.17 = (f32[5000,5000]{1,0}) tuple(reduce.16)
}

As you can see, there is 200*5000*5000 large intermediate array in this HLO. However it must be eliminated in further (user invisible) optimization, or OOM for 18.626451G memory cost.
g's xla_computation is

body_computation.28 {
  constant.33 = pred[] constant(false)
  parameter.29 = (f32[200,5000]{1,0}, s32[], f32[5000,5000]{1,0}) parameter(0)
  get-tuple-element.30 = f32[200,5000]{1,0} get-tuple-element(parameter.29), index=0
  get-tuple-element.31 = s32[] get-tuple-element(parameter.29), index=1
  constant.57 = s32[] constant(1)
  add.58 = s32[] add(get-tuple-element.31, constant.57)
  get-tuple-element.32 = f32[5000,5000]{1,0} get-tuple-element(parameter.29), index=2
  constant.34 = s32[] constant(0)
  compare.35 = pred[] compare(get-tuple-element.31, constant.34), direction=LT
  constant.36 = s32[] constant(200)
  add.37 = s32[] add(get-tuple-element.31, constant.36)
  select.38 = s32[] select(compare.35, add.37, get-tuple-element.31)
  constant.39 = s32[] constant(0)
  constant.40 = s32[] constant(0)
  compare.41 = pred[] compare(constant.39, constant.40), direction=LT
  constant.42 = s32[] constant(0)
  constant.43 = s32[] constant(5000)
  add.44 = s32[] add(constant.42, constant.43)
  constant.45 = s32[] constant(0)
  select.46 = s32[] select(compare.41, add.44, constant.45)
  dynamic-slice.47 = f32[1,5000]{1,0} dynamic-slice(get-tuple-element.30, select.38, select.46), dynamic_slice_sizes={1,5000}
  reshape.48 = f32[5000]{0} reshape(dynamic-slice.47)
  broadcast.49 = f32[5000,1]{1,0} broadcast(reshape.48), dimensions={0}
  reshape.51 = f32[5000]{0} reshape(broadcast.49)
  broadcast.52 = f32[5000,5000]{1,0} broadcast(reshape.51), dimensions={0}
  broadcast.50 = f32[1,5000]{1,0} broadcast(reshape.48), dimensions={1}
  reshape.53 = f32[5000]{0} reshape(broadcast.50)
  broadcast.54 = f32[5000,5000]{1,0} broadcast(reshape.53), dimensions={1}
  multiply.55 = f32[5000,5000]{1,0} multiply(broadcast.52, broadcast.54)
  add.56 = f32[5000,5000]{1,0} add(get-tuple-element.32, multiply.55)
  ROOT tuple.59 = (f32[200,5000]{1,0}, s32[], f32[5000,5000]{1,0}) tuple(get-tuple-element.30, add.58, add.56)
}

cond_computation.60 {
  parameter.61 = (f32[200,5000]{1,0}, s32[], f32[5000,5000]{1,0}) parameter(0)
  get-tuple-element.62 = f32[200,5000]{1,0} get-tuple-element(parameter.61), index=0
  get-tuple-element.64 = f32[5000,5000]{1,0} get-tuple-element(parameter.61), index=2
  constant.65 = pred[] constant(false)
  get-tuple-element.63 = s32[] get-tuple-element(parameter.61), index=1
  constant.66 = s32[] constant(200)
  ROOT compare.67 = pred[] compare(get-tuple-element.63, constant.66), direction=LT
}

ENTRY xla_computation_g.73 {
  constant.2 = pred[] constant(false)
  parameter.1 = f32[200,5000]{1,0} parameter(0)
  constant.3 = s32[] constant(0)
  constant.4 = s32[] constant(0)
  compare.5 = pred[] compare(constant.3, constant.4), direction=LT
  constant.6 = s32[] constant(0)
  constant.7 = s32[] constant(200)
  add.8 = s32[] add(constant.6, constant.7)
  constant.9 = s32[] constant(0)
  select.10 = s32[] select(compare.5, add.8, constant.9)
  broadcast.11 = s32[1]{0} broadcast(select.10), dimensions={}
  gather.12 = f32[5000]{0} gather(parameter.1, broadcast.11), offset_dims={0}, collapsed_slice_dims={0}, start_index_map={0}, index_vector_dim=0, slice_sizes={1,5000}, indices_are_sorted=true
  constant.13 = s32[] constant(0)
  constant.14 = s32[] constant(0)
  compare.15 = pred[] compare(constant.13, constant.14), direction=LT
  constant.16 = s32[] constant(0)
  constant.17 = s32[] constant(200)
  add.18 = s32[] add(constant.16, constant.17)
  constant.19 = s32[] constant(0)
  select.20 = s32[] select(compare.15, add.18, constant.19)
  broadcast.21 = s32[1]{0} broadcast(select.20), dimensions={}
  gather.22 = f32[5000]{0} gather(parameter.1, broadcast.21), offset_dims={0}, collapsed_slice_dims={0}, start_index_map={0}, index_vector_dim=0, slice_sizes={1,5000}, indices_are_sorted=true
  constant.25 = pred[] constant(false)
  constant.26 = s32[] constant(0)
  constant.23 = f32[] constant(0)
  broadcast.24 = f32[5000,5000]{1,0} broadcast(constant.23), dimensions={}
  tuple.27 = (f32[200,5000]{1,0}, s32[], f32[5000,5000]{1,0}) tuple(parameter.1, constant.26, broadcast.24)
  while.68 = (f32[200,5000]{1,0}, s32[], f32[5000,5000]{1,0}) while(tuple.27), condition=cond_computation.60, body=body_computation.28
  get-tuple-element.69 = f32[200,5000]{1,0} get-tuple-element(while.68), index=0
  get-tuple-element.70 = s32[] get-tuple-element(while.68), index=1
  get-tuple-element.71 = f32[5000,5000]{1,0} get-tuple-element(while.68), index=2
  ROOT tuple.72 = (f32[5000,5000]{1,0}) tuple(get-tuple-element.71)
}

1 reply

joeryjoery Feb 16, 2022
Author

Oh I did not realize XLA did this too. This works on my system and is super fast. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce functionality for vmap #9505

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reduce functionality for vmap #9505

Uh oh!

joeryjoery Feb 9, 2022

Replies: 2 comments · 1 reply

Uh oh!

joeryjoery Feb 10, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Feb 16, 2022

Uh oh!

joeryjoery Feb 16, 2022 Author

joeryjoery
Feb 9, 2022

Replies: 2 comments 1 reply

joeryjoery
Feb 10, 2022
Author

YouJiacheng
Feb 16, 2022

joeryjoery Feb 16, 2022
Author