jax.lax.scan & jax.lax.map going OOM under gradient computation #10131

noveens · 2022-04-04T08:26:58Z

noveens
Apr 4, 2022

Hello team,

I have a use-case where I need to perform multi-Gumbel-sampling on each row of a matrix while computing the gradient. I have been trying to use jax.lax.map to avoid memory explosion, but it still does.

For example, taking a [3000 x 1500] matrix where we need to take 1000 samples per row; the resulting memory of the following code is: f64[30,1000,100,1500] i.e. [NUM_SAMPLES_PER_ROW x NUM_ROWS x NUM_COLS]. But as per my understanding of jax.lax.map and jax.lax.scan, shouldn't the memory requirement be [NUM_SAMPLES_PER_ROW x NUM_ROWS_PER_BATCH x NUM_COLS]?

Hopefully the below code will make it easier to understand and reproduce:

import jax
import jax.numpy as jnp

from jax.config import config
config.update('jax_enable_x64', True)

NUM_SAMPLES_PER_ROW = 1000
NUM_ROWS = 3000
NUM_COLS = 1500

GUMBEL_TAU = 0.7

@jax.jit
def multi_gumbel_sample(input, key):
	# This function performs gumbel sampling on n-rows together, once
	gumbel_sample_once = jax.jit(lambda logits, key: jax.nn.softmax(
		(logits + jax.lax.stop_gradient(jax.random.gumbel(key, logits.shape))) / GUMBEL_TAU
	))

	# Make all the keys we would need, together
	all_keys = jax.random.split(key, num = NUM_SAMPLES_PER_ROW + 1)

	# Will do gumbel sampling `NUM_SAMPLES_PER_ROW` times for each row in a given matrix
	def final_function(args):
		x_raw, keys = args
		
		# Loop using scan -- Memory intensive
		# Memory needed is [ NUM_BATCHES x ROW_BATCH_SIZE x NUM_SAMPLES_PER_ROW x NUM_COLS ]
		ret = jax.lax.scan(
			lambda x, step: (x + gumbel_sample_once(x_raw, keys[step + 1]), None),
			gumbel_sample_once(x_raw, keys[1]),
			jnp.arange(NUM_SAMPLES_PER_ROW - 1), 
			length = NUM_SAMPLES_PER_ROW - 1,
		)[0]

		# Manual expanding & reduce -- Still memory intensive
		# ret = jnp.sum(gumbel_sample_once(x_raw[jnp.newaxis, ...].repeat(NUM_SAMPLES_PER_ROW, axis = 0), keys[1]), axis = 0)

		# Loop using fori_loop -- Still memory intensive
		# def loop_body(i, sum_till_now): return sum_till_now + gumbel_sample_once(x_raw, keys[i])
		# ret = jax.lax.fori_loop(0, NUM_SAMPLES_PER_ROW, loop_body, jnp.zeros(x_raw.shape))

		return ret # [ ROW_BATCH_SIZE x NUM_COLS ]

	ROW_BATCH_SIZE = 100

	# Do batching on rows & use jax.lax.map over the given matrix to save memory
	num_batches = NUM_ROWS // ROW_BATCH_SIZE
	send_keys = jax.lax.stop_gradient(all_keys[jnp.newaxis, :].repeat(num_batches, axis = 0)) # [ NUM_BATCHES x NUM_SAMPLES_PER_ROW x 2 ]
	
	send_input = input.reshape(num_batches, ROW_BATCH_SIZE, NUM_COLS) # [ NUM_BATCHES x ROW_BATCH_SIZE x NUM_COLS ] ~~ [ NUM_ROWS x NUM_COLS ]
	final = jax.lax.map(final_function, (send_input, send_keys)).reshape(input.shape) # [ NUM_ROWS x NUM_COLS ]
	
	return final, all_keys[-1]

if __name__ == "__main__":
	key = jax.random.PRNGKey(0)

	rand_input = jax.random.normal(key, shape = (NUM_ROWS, NUM_COLS)) # [ NUM_ROWS x NUM_COLS ]

	# Taking the number of samples (proxy for the entropy in the input matrix) as a dummy loss-function
	grad_fn = jax.grad(
            lambda input, key: jnp.sum(multi_gumbel_sample(
                input,
                key
            )), has_aux=True
        )

	# PASSES
	output, key = multi_gumbel_sample(rand_input, key) # [ NUM_ROWS x NUM_COLS ]
	print(output.shape)

	# FAILS -- OOM
	output, key = grad_fn(rand_input, key) # [ NUM_ROWS x NUM_COLS ]
	print(output.shape)

And below are some OOM Debugging stats:

BufferAssignment stats:
             parameter allocation:   34.33MiB
              constant allocation:        84B
        maybe_live_out allocation:   33.61GiB
     preallocated temp allocation:    1.12GiB
  preallocated temp fragmentation:       700B (0.00%)
                 total allocation:   34.76GiB
              total fragmentation:     2.9KiB (0.00%)
Peak buffers:
	Buffer 1:
		Size: 33.49GiB
		Operator: op_name="jit(jvp(multi_gumbel_sample))/jit(main)/broadcast_in_dim[shape=(30, 999, 100, 1500) broadcast_dimensions=()]" source_file="/home/noveens/test_gumbel.py" source_line=52
		XLA Label: broadcast
		Shape: f64[30,999,100,1500]
		==========================

	Buffer 2:
		Size: 1.12GiB
		XLA Label: parameter
		Shape: f64[999,100,1500]
		==========================

	. . .

Thanks in advance for the help!

Answered by YouJiacheng

Apr 4, 2022

Hi! Vanilla reverse-mode autodiff (jax.grad default) need to store all intermediate values, which causes the OOM.
You can leverage jax.checkpoint to reduce memory consumption (at the cost of extra computation).
How to: (for the loop in the final_function , not the loop over batches.)
(Actually it can work for the loop over batches, but there is a more efficient way.)

divide N steps scan into sqrt(N) chunks of sqrt(N) steps.
wrap each chunk with jax.checkpoint (actually can be implemented with a nested scan, and checkpoint the inner scan).
Thus JAX will only store the inputs of each chunk, and recompute other intermediate values in each chunk when back-prop through this chunk.
Thus the pe…

View full answer

YouJiacheng · 2022-04-04T12:28:09Z

YouJiacheng
Apr 4, 2022

Hi! Vanilla reverse-mode autodiff (jax.grad default) need to store all intermediate values, which causes the OOM.
You can leverage jax.checkpoint to reduce memory consumption (at the cost of extra computation).
How to: (for the loop in the final_function , not the loop over batches.)
(Actually it can work for the loop over batches, but there is a more efficient way.)

divide N steps scan into sqrt(N) chunks of sqrt(N) steps.
wrap each chunk with jax.checkpoint (actually can be implemented with a nested scan, and checkpoint the inner scan).
Thus JAX will only store the inputs of each chunk, and recompute other intermediate values in each chunk when back-prop through this chunk.
Thus the peak memory will be 2 * sqrt(N) instead of N, at the cost of 2x computation.
More advanced recursive checkpoint strategy can reduced the peak memory to log_2(N), at the cost of computation from N to N * log_2(N).

If you want to divide the large data into some batches, then you shouldn't loop over batches inside jax.grad.
You should loop over batches outside the jax.grad, and accumulate the gradients, just like SGD.
In your case(gradient w.r.t. input), you can do computation w.r.t. one batch but input full data and accumulate gradients(there will be many zeros in each gradient, i.e. unused data doesn't receive gradient), or input only one batch and stack all gradient, which can be naturally accomplished with scan(using the second output).

BTW, I find that the loop in the final_function doesn't have data dependency between iterations, you can accumulate gradient for it as well. But this can be slightly tricky, if you don't want to have another function dedicated for gradient computation. You may need to use jax.custom_vjp for final_function, save the input in the custom forward function and do the accumulation in the custom backward function.

3 replies

noveens Apr 5, 2022
Author

Hi @YouJiacheng, thank you for the wonderful series of insights!

Turns out the easiest way to fix this issue was to move jax.lax.map outside the grad function as you suggested. I also tried using jax.custom_vjp as you suggested to decrease the memory usage of final_function but I soon ran into errors of not being able to take gradient w.r.t. the hidden carry variable of jax loops.

Nonetheless, here is the fixed code for people who get stuck on similar issues:

import jax
import jax.numpy as jnp

from jax.config import config
config.update('jax_enable_x64', True)

NUM_SAMPLES_PER_ROW = 1_000
NUM_ROWS = 3000
NUM_COLS = 1500

GUMBEL_TAU = 0.7

@jax.jit
def multi_gumbel_sample(input, key):
	# This function performs gumbel sampling on n-rows together, once
	gumbel_sample_once = jax.jit(lambda logits, key: jax.nn.softmax(
		(logits + jax.lax.stop_gradient(jax.random.gumbel(key, logits.shape))) / GUMBEL_TAU
	))

	# Make all the keys we would need, together
	all_keys = jax.random.split(key, num = NUM_SAMPLES_PER_ROW + 1)

	# Will do gumbel sampling `NUM_SAMPLES_PER_ROW` times for each row in a given matrix
	def final_function(x_raw, keys):
		# Loop using scan -- Memory intensive
		# Memory needed is [ NUM_BATCHES x ROW_BATCH_SIZE x NUM_SAMPLES_PER_ROW x NUM_COLS ]
		ret = jax.lax.scan(
			lambda x, step: (x + gumbel_sample_once(x_raw, keys[step + 1]), None),
			gumbel_sample_once(x_raw, keys[1]),
			jnp.arange(NUM_SAMPLES_PER_ROW - 1), 
			length = NUM_SAMPLES_PER_ROW - 1,
		)[0]

		return ret # [ ROW_BATCH_SIZE x NUM_COLS ]

	return final_function(input, all_keys)

if __name__ == "__main__":
	key = jax.random.PRNGKey(0)# [jnp.newaxis, :].repeat(NUM_ROWS, axis = 0)

	rand_input = jax.random.normal(key, shape = (NUM_ROWS, NUM_COLS)) # [ NUM_ROWS x NUM_COLS ]

	# Taking the number of samples (proxy for the entropy in the input matrix) as a dummy loss-function
	grad_fn = jax.grad(
		lambda input, key: jnp.sum(multi_gumbel_sample(
			input,
			key
		))
	)

	ROW_BATCH_SIZE = 100

	# PASSES
	output = multi_gumbel_sample(rand_input, key) # [ NUM_ROWS x NUM_COLS ]
	print(output.shape, jnp.mean(output))

	# PASSES
	# Do batching on rows & use jax.lax.map over the given matrix to save memory
	num_batches = NUM_ROWS // ROW_BATCH_SIZE
	send_input = rand_input.reshape(num_batches, ROW_BATCH_SIZE, NUM_COLS) # [ NUM_BATCHES x ROW_BATCH_SIZE x NUM_COLS ] ~~ [ NUM_ROWS x NUM_COLS ]
	_, output = jax.lax.scan(lambda _, x: ((), grad_fn(x, key)), (), send_input) # Equivalent to `jax.lax.map`
	output = output.reshape(rand_input.shape)
	print(output.shape, jnp.mean(output))

	# FAILS -- OOM
	output2 = grad_fn(rand_input, key) # [ NUM_ROWS x NUM_COLS ]
	print(output2.shape, jnp.mean(output2))

YouJiacheng Apr 5, 2022

@noveens It seems that following code works.
I check the result against naive gradient(w/o custom_vjp) for NUM_SAMPLES_PER_ROW = 100:
-1.531396e-10 v.s. -1.531396e-10

import jax
import jax.numpy as jnp


NUM_SAMPLES_PER_ROW = 1_000
NUM_ROWS = 3000
NUM_COLS = 1500

GUMBEL_TAU = 0.7

@jax.jit
def multi_gumbel_sample(input, key):
	# This function performs gumbel sampling on n-rows together, once
	gumbel_sample_once = jax.jit(lambda logits, key: jax.nn.softmax(
		(logits + jax.lax.stop_gradient(jax.random.gumbel(key, logits.shape))) / GUMBEL_TAU
	))

	# Make all the keys we would need, together
	all_keys = jax.random.split(key, num = NUM_SAMPLES_PER_ROW + 1)

	# Will do gumbel sampling `NUM_SAMPLES_PER_ROW` times for each row in a given matrix
	@jax.custom_vjp
	def final_function(x_raw, keys):
		ret = jax.lax.scan(
			lambda x, step: (x + gumbel_sample_once(x_raw, keys[step + 1]), None),
			gumbel_sample_once(x_raw, keys[1]),
			jnp.arange(NUM_SAMPLES_PER_ROW - 1), 
			length = NUM_SAMPLES_PER_ROW - 1,
		)[0]
		return ret # [ ROW_BATCH_SIZE x NUM_COLS ]
		
	def f_fwd(x_raw, keys):
		return final_function(x_raw, keys), (x_raw, keys)
	def f_bwd(res, g):
		x_raw, keys = res
		def gumbel_sample_once_vjp(x, key):
			_, vjp_fun = jax.vjp(gumbel_sample_once, x, key)
			return vjp_fun(g)[0] # don't need grad w.r.t. key
		return jax.lax.scan(
			lambda acc, step: (acc + gumbel_sample_once_vjp(x_raw, keys[step + 1]), None),
			gumbel_sample_once_vjp(x_raw, keys[1]),
			jnp.arange(NUM_SAMPLES_PER_ROW - 1), 
			length = NUM_SAMPLES_PER_ROW - 1,
		)[0], None # don't need grad w.r.t. keys
	final_function.defvjp(f_fwd, f_bwd)

	return final_function(input, all_keys)

if __name__ == "__main__":
	key = jax.random.PRNGKey(0)# [jnp.newaxis, :].repeat(NUM_ROWS, axis = 0)

	rand_input = jax.random.normal(key, shape = (NUM_ROWS, NUM_COLS)) # [ NUM_ROWS x NUM_COLS ]

	# Taking the number of samples (proxy for the entropy in the input matrix) as a dummy loss-function
	grad_fn = jax.grad(
		lambda input, key: jnp.sum(multi_gumbel_sample(
			input,
			key
		))
	)

	# PASSES
	output2 = grad_fn(rand_input, key) # [ NUM_ROWS x NUM_COLS ]
	print(output2.shape, jnp.mean(output2))

noveens Apr 5, 2022
Author

This is perfect! I was definitely missing a few key components to implementing this. Thanks for the implementation, @YouJiacheng! +rep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jax.lax.scan & jax.lax.map going OOM under gradient computation #10131

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

jax.lax.scan & jax.lax.map going OOM under gradient computation #10131

Uh oh!

Uh oh!

noveens Apr 4, 2022

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

YouJiacheng Apr 4, 2022

Uh oh!

noveens Apr 5, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Apr 5, 2022

Uh oh!

noveens Apr 5, 2022 Author

noveens
Apr 4, 2022

Replies: 1 comment 3 replies

YouJiacheng
Apr 4, 2022

noveens Apr 5, 2022
Author

noveens Apr 5, 2022
Author