Per-example gradients with BatchNorm #10882

yoeldr · 2022-05-30T15:38:57Z

yoeldr
May 30, 2022

How would you suggest computing the per-example gradient on models that contain BatchNorm?

The documentation suggests applying vmap, however, that solution requires the loss function to accept one example at a time, which is not possible when using BatchNorm.

For example, in the resnet50 example, I would like to replace the return value of loss by return -jnp.sum(logits * targets, axis=-1) then update the call to grad inside update() so it returns the desired gradients.

Answered by C-J-Cundy

May 30, 2022

Great, your example makes things clearer.
In your example, $A$ represents the parameters of the network, $x$ represents a batch of inputs, $f(x)$ represents a batch of outputs. So $f(x)$ is the function mapping an input batch to an output batch. (I think it makes sense to assume $m=n$ here).
Essentially the quantity you want to compute is the diagonal of the Jacobian matrix. In the general case, computing a Jacobian will take $\mathcal{O}(n)$ Jacobian-vector or vector-Jacobian products, each of which are roughly the same order as evaluating the network on the batch. So e.g. $\mathcal{O}(n^2)$ memory, since each batch has $n$ inputs.

The key point is that for a typical network without batc…

View full answer

C-J-Cundy · 2022-05-30T19:32:25Z

C-J-Cundy
May 30, 2022

Can you write out mathematically precisely what the quantity you'd like to compute is? I think that what you're asking for might not exist.

I believe the subtle point with batch-norm is that the inputs in a batch are all combined, such that the gradient of the loss with respect to parameters given a vector of inputs $\mathbf{x}$ and $\mathbf{y}$ is not the sum of the 'per-example-gradients', since the network simply isn't defined on inputs of batch size 1.

You can construct a loss function which would take a single input $x_i$ as an argument as follows:

def loss(params, y_i, x_i, rest_of_xs, rest_of_ys):
    xs = jnp.stack((x_i, rest_of_xs))
    ys = jnp.stack((y_i, rest_of_ys))
    return loss(params, xs, ys)

This would only help in the cases where you're trying to compute a gradient with respect to an individual input $x_i$ (e.g. for enforcing a Lipschitzness constraint for a Wasserstein GAN), not the typical parameter gradient that I think you're dealing with.

4 replies

yoeldr May 30, 2022
Author

It's indeed a good idea to look at it mathematically, and perhaps it's better to look at a simpler case first. Consider the function
$$f(x) = A x,$$ where $x\in \mathbb{R}^n$ and $A\in \mathbb{R}^{m\times n}$ is a constant. Then, the value I want to compute is the vector-by-vector derivative $\frac{\partial}{\partial x} f(x) = A$.

To use grad for computing that derivative we'll need to define $m$ scalar functions by "unstacking" $f(x)$. This can be done, for example, using the code

def f_i(i):
  return lambda x: f(x)[i]

result = [grad(f_i(i))(x) for i in range (m)]

which is very inefficient since we have to call f(x) $m$ times. A more efficient way is to reimplement f_i so that it only computes the desired part of the output, however, that approach will not work for more complex cases such as the BatchNorm case where we can't efficiently compute each element of the output independently.

davisyoshida May 30, 2022
Collaborator

If you have a JAX friendly function which computes some output on a collection of inputs (including using BatchNorm), you can just use jax.grad on it.

def f(parameters, batch):
     # math here
     return output

grad_wrt_inputs = jax.grad(f, argnums=1)

parameter_grads = grad_wrt_inputs(params, batch)

What's the problem with this?

yoeldr May 30, 2022
Author

@davisyoshida jax.grad assumes that the function is scalar, however, in this case it a vector (since I want the per-example gradient), and unlike the example in the doc, it cannot be vectorized per-coordinate using vmap.

Take for example, the following function

def f(x):
  x = x + 1.   # Imagine this is a very expensive operation
  return [x, 2. * x, 3. * x]

now, the problem is how to compute the gradient of f without calling it three times.

davisyoshida May 31, 2022
Collaborator

Ah right. Well as C-J-Cundy says, you really do need to take a Jacobian here, then extract the diagonal.

C-J-Cundy · 2022-05-30T22:28:52Z

C-J-Cundy
May 30, 2022

Great, your example makes things clearer.
In your example, $A$ represents the parameters of the network, $x$ represents a batch of inputs, $f(x)$ represents a batch of outputs. So $f(x)$ is the function mapping an input batch to an output batch. (I think it makes sense to assume $m=n$ here).
Essentially the quantity you want to compute is the diagonal of the Jacobian matrix. In the general case, computing a Jacobian will take $\mathcal{O}(n)$ Jacobian-vector or vector-Jacobian products, each of which are roughly the same order as evaluating the network on the batch. So e.g. $\mathcal{O}(n^2)$ memory, since each batch has $n$ inputs.

The key point is that for a typical network without batchnorm, the matrix $A$ is actually diagonal, since the first $x_1$ in the input batch doesn't affect any of the outputs except the first one: the computation is embarrassingly parallel. However, for the batchnormed-network, the matrix $A$ is not diagonal: if you were to perturb $x_1$ slightly, all of the outputs would change for the batch.

I'm not an expert on autodiff upper/lower bounds, but I believe that the fact that the Jacobian is known to be diagonal in the non-batchnorm case is what allows the speed up in computing the per-example gradients. You have to store $\mathcal{O}(d)$ passes in the diagonal case but $\mathcal{O}(d^2)$ in the non-diagonal case.

3 replies

yoeldr May 30, 2022
Author

A small correction: since we want to differentiate w.r.t. $x$ then $x$ is the network parameter, and the batch is $A$... Other than that I agree with what you wrote. Indeed, in most cases the function is separable on the examples in the batch, so the vmap solution works, however, this is not always the case.

In all other major auto-diff platforms it's possible to compute the gradient of a vector, however, I don't know how to do that in jax in situations where vmap is not applicable...

C-J-Cundy May 30, 2022

Actually the difference between wanting the derivative wrt $x$ or $A$ changes things quite a bit, what I wrote only holds if you're looking for the derivative wrt inputs $x$.
To compute Jacobians you can use the jacfwd and jacrev functions. However these are expensive to compute. You can use the jvp and vjp functions to compute Jacobian-vector and vector-Jacobian products. These are cheaper but you don't get a full Jacobian, only views into its rows/columns.

However, I encourage you to have a think about whether what you're doing makes sense overall. At the end of the day, a batchormed function doesn't really have per-example-gradients in a meaningful sense, since you can't say that the contributions to the output at index $i$ only came from the inputs at index $i$.

yoeldr May 31, 2022
Author

Right, I was looking for jax.jacrev, and the example in the doc is almost identical to the one I wrote above :)
Thanks!

Per-example gradients with BatchNorm #10882

Uh oh!

yoeldr May 30, 2022

Replies: 2 comments · 7 replies

Uh oh!

Uh oh!

C-J-Cundy May 30, 2022

Uh oh!

yoeldr May 30, 2022 Author

Uh oh!

davisyoshida May 30, 2022 Collaborator

Uh oh!

yoeldr May 30, 2022 Author

Uh oh!

davisyoshida May 31, 2022 Collaborator

Uh oh!

C-J-Cundy May 30, 2022

Uh oh!

yoeldr May 30, 2022 Author

Uh oh!

C-J-Cundy May 30, 2022

Uh oh!

yoeldr May 31, 2022 Author

yoeldr
May 30, 2022

Replies: 2 comments 7 replies

C-J-Cundy
May 30, 2022

yoeldr May 30, 2022
Author

davisyoshida May 30, 2022
Collaborator

yoeldr May 30, 2022
Author

davisyoshida May 31, 2022
Collaborator

C-J-Cundy
May 30, 2022

yoeldr May 30, 2022
Author

yoeldr May 31, 2022
Author