High memory usage in chunked matrix vector products #9490

wjmaddox · 2022-02-08T15:55:07Z

wjmaddox
Feb 8, 2022

Hi,

I'm trying to implement some large matrix vector products (where forming the matrix is too large to fit in memory) and then to compute the gradient with respect to the inputs, but am running into memory errors in the forwards pass when using both jax.lax.scan and explicit broadcasting:

import jax
import jax.numpy as jnp
from jax import custom_jvp, custom_vjp

from jax.random import PRNGKey
from jax.random import normal
from jax import make_jaxpr, vmap
from jax import grad, value_and_grad

def compute_sq_distances(xi, xj):
    xi, xj = jnp.expand_dims(xi, -2), jnp.expand_dims(xj, -3)
    distances = (xi - xj) ** 2.
    return distances

def compute_rbf_mvm_chunk(xi, xj, vj):
    dist = compute_sq_distances(xi, xj)
    return jnp.exp(-0.5 * dist.sum(-1)) @ vj  

# @jax.jit # both jit and non-jit versions fail
def fancy_lossfn(scaling):
    x_scaled = train_x / scaling
    
    chunks = jnp.array_split(x_scaled, 50)
    chunks2 = chunks

    vchunks = jnp.array_split(probes, 50)
    chunks_stacked = jnp.stack(chunks)
    vchunks_stacked = jnp.stack(vchunks)
    
    def scan_mvm(xx, i):
        res = compute_rbf_mvm_chunk(i, chunks_stacked, vchunks_stacked).sum(-3)
        return xx+1, res
    
    vals = jax.lax.scan(
        scan_mvm, 0,  xs=chunks_stacked)[1]
    KV = vals.reshape(-1, vals.shape[-1])

    # this also OOMs and uses the same amount of memory
    # KV = compute_rbf_mvm_chunk(x_scaled, chunks_stacked, vchunks_stacked).sum(-3)
    return jnp.sum(probes.T @ KV)

## edit: oops forgot the forwards code
key = PRNGKey(seed=21)

train_x = normal(key, shape=(50000, 3))
probes = normal(key, shape=(50000, 10))

value_and_grad(fancy_lossfn)(3.0)

and the error i get is an out of memory requesting 30GBs:

2022-02-08 10:14:33.390442: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 27.94GiB (rounded to 30000000000)requested by op 
2022-02-08 10:14:33.390759: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:474] *___________________________________________________________________________________________________
2022-02-08 10:14:33.391115: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2086] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 30000000000 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:         4B
              constant allocation:    6.29MiB
        maybe_live_out allocation:   37.26GiB
     preallocated temp allocation:         0B
                 total allocation:   37.26GiB
              total fragmentation:    6.29MiB (0.02%)
Peak buffers:
	Buffer 1:
		Size: 27.94GiB
		Operator: op_type="mul" op_name="jit(jvp(fancy_lossfn))/jit(jvp(compute_rbf_mvm_chunk))/mul" source_file="/home/wesley_m/lowpres_gps/experiments/../core/kernels/rbf_ard_kernel.py" source_line=12
		XLA Label: fusion
		Shape: f32[50,50000,1000,3]
		==========================

	Buffer 2:
		Size: 9.31GiB
		Operator: op_type="exp" op_name="jit(jvp(fancy_lossfn))/jit(jvp(compute_rbf_mvm_chunk))/exp" source_file="/home/wesley_m/lowpres_gps/experiments/../core/kernels/rbf_ard_kernel.py" source_line=19
		XLA Label: fusion
		Shape: f32[50,50000,1000]
		==========================

	Buffer 3:
		Size: 1.91MiB
		Operator: op_type="concatenate" op_name="jit(jvp(fancy_lossfn))/concatenate[dimension=0]" source_file="/tmp/ipykernel_59448/2064711867.py" source_line=10
		XLA Label: constant
		Shape: f32[50,1000,10]
		==========================

Is there a more jax-like way to perform these type of matrix vector products?

System reference:
jax = '0.2.26'
jaxlib = '0.1.75'
cuda = '11.3'
and using linux

YouJiacheng · 2022-02-16T10:05:19Z

YouJiacheng
Feb 16, 2022

For this specific problem, I have found 4 solution with forward-mode autodiff.

def sq_d(xi, xj):
    xi, xj = jnp.expand_dims(xi, -2), jnp.expand_dims(xj, -3)
    return jnp.sum(jnp.square(xi - xj), -1)

def rbf_mvm(xi, xj, vj):
    return jnp.exp(-0.5 * sq_d(xi, xj)) @ vj

train_x = jax.random.normal(jax.random.PRNGKey(0), (50000, 3), jnp.float32)
probes = jax.random.normal(jax.random.PRNGKey(5), (50000, 10), jnp.float32)
chunks = 50
def f(scaling): # success
    x_scaled = train_x / scaling
    x_chunked = jnp.split(x_scaled, chunks)
    v_chunked = jnp.split(probes, chunks)
    kv = 0.0
    for i in range(chunks):
        kv = kv + rbf_mvm(x_scaled, x_chunked[i], v_chunked[i])

    return jnp.sum(probes.T @ kv)

def chunk(x: jnp.ndarray, chunks):
    return jnp.reshape(x, (chunks, -1, *x.shape[1:]))

def g(scaling): # fail
    x_scaled = train_x / scaling
    x_chunked = chunk(x_scaled, chunks)
    v_chunked = chunk(probes, chunks)
    kv = jnp.sum(rbf_mvm(x_scaled, x_chunked, v_chunked), 0)

    return jnp.sum(probes.T @ kv)

def h(scaling): # success
    x_scaled = train_x / scaling
    x_chunked = chunk(x_scaled, chunks)
    v_chunked = chunk(probes, chunks)
    def body_fun(i, val):
        return val + rbf_mvm(x_scaled, x_chunked[i], v_chunked[i])
    out = jax.eval_shape(rbf_mvm, x_scaled, x_chunked[0], v_chunked[0])
    kv = jax.lax.fori_loop(0, chunks, body_fun, jnp.zeros(out.shape, out.dtype))

    return jnp.sum(probes.T @ kv)

def ff(scaling): # success
    x_scaled = train_x / scaling
    x_chunked = chunk(x_scaled, chunks)
    v_chunked = chunk(probes, chunks)
    def scan_fun(carry, xv):
        return carry + rbf_mvm(x_scaled, xv[0], xv[1]), None
    out = jax.eval_shape(rbf_mvm, x_scaled, x_chunked[0], v_chunked[0])
    kv = jax.lax.scan(scan_fun, jnp.zeros(out.shape, out.dtype), [x_chunked, v_chunked])[0]

    return jnp.sum(probes.T @ kv)

def gg(scaling): # success
    x_scaled = train_x / scaling
    x_chunked = chunk(x_scaled, chunks)
    v_chunked = chunk(probes, chunks)
    def scan_fun(i, x_i):
        return i + 1, jnp.sum(rbf_mvm(x_i, x_chunked, v_chunked), 0)
    val: jnp.ndarray = jax.lax.scan(scan_fun, 0, x_chunked)[1]
    kv = jnp.reshape(val, (-1, *val.shape[2:]))

    return jnp.sum(probes.T @ kv)

print(jax.jvp(f, (3.0,), (1.0,))) # use forward-mode autodiff to save memory
print(jax.jvp(h, (3.0,), (1.0,)))
print(jax.jvp(ff, (3.0,), (1.0,)))
print(jax.jvp(gg, (3.0,), (1.0,)))

I think it maybe not the forward pass itself cause OOM, since jax.jit(fancy_lossfn)(3.0) will not OOM.
I have tried to use vanilla python for loop, the computation is successfully chunked, but still OOM. This may imply the built-in reverse-mode autodiff need all intermediate at the same time.

RuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 600000000 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:         4B
              constant allocation:         9B
        maybe_live_out allocation:   37.26GiB
     preallocated temp allocation:         0B
                 total allocation:   37.26GiB
              total fragmentation:        33B (0.00%)
Peak buffers:
        Buffer 1:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 2:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 3:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 4:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 5:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 6:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 7:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 8:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 9:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 10:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 11:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 12:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 13:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 14:
                Size: 572.20MiB
                XLA Label: copy
                Shape: f32[50000,1000,3]
                ==========================

        Buffer 15:
                Size: 572.20MiB
                Operator: op_type="sub" op_name="jit(jvp(f))/sub"
                XLA Label: fusion
                Shape: f32[50000,1000,3]
                ==========================

1 reply

wjmaddox Feb 16, 2022
Author

Thanks for pointing that out. It definitely seems like something in the vjp of the scan itself thats causing the issues.

My current workaround has been to use a custom backwards for reverse mode autograd where the gradient is basically just accumulated via a scan of the vjp for each chunk. I'll have to look into seeing if forwards mode AD is usable for our implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High memory usage in chunked matrix vector products #9490

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

High memory usage in chunked matrix vector products #9490

Uh oh!

Uh oh!

wjmaddox Feb 8, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

YouJiacheng Feb 16, 2022

Uh oh!

wjmaddox Feb 16, 2022 Author

wjmaddox
Feb 8, 2022

Replies: 1 comment 1 reply

YouJiacheng
Feb 16, 2022

wjmaddox Feb 16, 2022
Author