Per-sample gradients for RNN #18704

smorad · 2023-11-28T16:28:16Z

smorad
Nov 28, 2023

What is the correct way to compute per-sample gradients for an RNN? The per-sample grad in the jax documentation requires calling grad before vmap (i.e., vmap(grad(f))), but in many cases recurrent models already contain calls to vmap, so this vmap(grad(f)) wrapping is not possible. See below for a working example:

import jax
import jax.numpy as jnp

# Params
b = jax.random.normal(jax.random.PRNGKey(0), (3,))
W = jax.random.normal(jax.random.PRNGKey(1), (3, 3))
W_out = jax.random.normal(jax.random.PRNGKey(1), (1, 3))
params = (b, W, W_out)

# Training data
# Batch size x feature dim
x = jnp.arange(5 * 3, dtype=jnp.float32).reshape(5, 3)
# We only predict a single output for a sequence of inputs
y = jnp.array([1.0])

def linear_rnn(params, x, state):
    b, W, W_out = params
    f_inputs = jax.vmap(lambda b, x: b + x, in_axes=(None, 0))
    f_state = jax.vmap(lambda W, x: W @ x, in_axes=(None, 0))
    f_output = jax.vmap(lambda W_out, x: W_out @ x, in_axes=(None, 0))

    x = f_inputs(b, x)
    state = state + f_state(W, x) 
    x = f_output(W_out, state)
    return x, state[-1]


def loss(params, x, y):
    state = jnp.zeros((1, 3))
    pred, state = linear_rnn(params, x, state)
    # Given n inputs, predict only one output
    pred = pred[-1]
    return jnp.mean(0.5 * (pred - y) ** 2)


gb, gW, gW_out = jax.grad(loss)(params, x, y)
print(gb.shape) # (3,)

I would like gb.shape to be (5, 3), not 3, without rewriting linear_rnn.

jakevdp · 2023-11-28T17:06:44Z

jakevdp
Nov 28, 2023
Maintainer

From the expected shape, it sounds like you want the gradient with respect to the x values rather than the parameters. Does this do what you want?

gx = jax.grad(loss, argnums=1)(params, x, y)

If that's not what you have in mind, I'm unclear on where the batch size of 5 would come from.

1 reply

smorad Nov 28, 2023
Author

I think I'm looking for something slightly different. Rather than summing up dloss/dparams for all elements of x, I would like one dloss/dparams for each element of x. I could then sum these individual gradients to produce the standard jax.grad(loss)(params, x, y).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-sample gradients for RNN #18704

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Per-sample gradients for RNN #18704

Uh oh!

smorad Nov 28, 2023

Replies: 1 comment · 1 reply

Uh oh!

jakevdp Nov 28, 2023 Maintainer

Uh oh!

smorad Nov 28, 2023 Author

smorad
Nov 28, 2023

Replies: 1 comment 1 reply

jakevdp
Nov 28, 2023
Maintainer

smorad Nov 28, 2023
Author