Is it possible to compute the Hessian in a single pass? #12735

Hrrsmjd · 2022-10-10T23:25:26Z

Hrrsmjd
Oct 10, 2022

I understand how to use JAX to compute the Hessian using automatic differentiation; however, I am having difficulty understanding how it works. In particular, I don't understand why we need two ~~passes~~ applications of automatic differentiation (i.e., reverse then forward or forward then reverse)*.

Assume that $f$ is defined as a composition of functions:
$$f=f_2 \circ f_1$$
where $f_1:\mathbb{R}^n \rightarrow \mathbb{R}^{m_1}$ and $f_2:\mathbb{R}^{m_1} \rightarrow \mathbb{R}^n$. We can compute the Jacobian $J_f(x)$ using the chain rule:
$$J_f(x) = J_{f_2}(f_1(x))J_{f_1}(x)$$

and we can also compute the Hessian $H_f(x)$:
$$H_f(x) = J_{f_1}(x)^TH_{f_2}(f_1(x))J_{f_1}(x) + J_{f_2}(f_1(x))H_{f_1}(x)$$

The JAX Autodiff Cookbook describes two methods for computing the hessian using auto differentiation:

forward-over-reverse
reverse-over-forward

Why can't we use a single forward mode ~~pass~~ application to compute $f_1(x)$, $J_{f_1}(x)$, and $H_{f_1}(x)$ in the same way we compute $f_1(x)$ and $J_{f_1}(x)$ for the Jacobian? We can then compute $f_2(f_1(x))$, $J_{f_2}(f_1(x))$, and $H_{f_2}(f_1(x))$, and then put everything together.

* If, for example, we choose to do forward mode and then reverse mode, then the output of the forward mode must include ~~symbolic derivatives~~ some function, not just the Jacobian at a specific point. Otherwise, how would the reverse mode work? Perhaps I am missing something, so it would be helpful if someone could provide a simple example.

mattjj · 2022-10-11T02:00:56Z

mattjj
Oct 11, 2022
Maintainer

Thanks for the question!

Actually, the autodiff cookbook talks about reverse-over-reverse too: that's the first example in the section you linked, grad(lambda x: jnp.vdot(grad(f)(x), v))(x). You can do forward-over-forward too.

To compute the Hessian of f = f_2 \circ f_1, you can certainly compute Jacobians and Hessians of f_2 and f_1 at the appropriate points and then contract them (ie multiply them) appropriately. But that's just the second-order version of computing the Jacobian of f by computing the Jacobians of f_2 and f_1 and multiplying. That's not usually a good idea because forming dense Jacobians that way typically throws away sparsity structure, represented directly by the program as data dependence (i.e. not all outputs of f_1 depend on all inputs to f_1, and not all outputs of f_2 depend on all inputs to f_2).

I'm not sure exactly what you mean by a 'pass' here, but we can't use a single application of forward-mode because we've got to compute second derivatives of primitives somehow. That is, if the first derivative of the primitive g_0 involves applications of the primitive g_1, and the first derivative of g_1 involves applications of the primitive g_2, then we've got to generate g_2 somehow. In other words, if we take f_1 and f_2 to be primitives in your example, how would a single application of forward-mode compute H_{f_1} and H_{f_2}?

If, for example, we choose to do forward mode and then reverse mode, then the output of the forward mode must include symbolic derivatives, not just the Jacobian at a specific point.

I'm not sure exactly what you mean by 'symbolic derivatives', but indeed you can use jvp or vjp (or grad) to write a function which itself can be differentiated. That is, just like mathematically we might have (∂^2 f)(x) = (∂g)(x) where g(x) = (∂f)(x), so too g = grad(f) is a function which can be differentiated as grad(g). For indeed, for scalar-input-scalar-output functions grad(grad(f)) computes the Hessian of f. We can do the same with jvp, e.g. by defining the helper deriv = lambda f: lambda x: jvp(f, (x,), (1.,)), then writing deriv(deriv(f)) to get the Hessian.

Maybe it's helpful to write out these functions in Python-like syntax, something like:

def jvp(sin) x xdot:
  y = sin(x)
  ydot = cos(x) * xdot
  return y, ydot

def lin(sin) x:
  y = sin(x)
  cos_x = cos(x)
  return y, lambda xdot: cos_x * xdot

def vjp(sin) x:
  y = sin(x)
  cos_x = cos(x)
  return y, lambda ybar: cos_x * ybar


def jvp(f . g) x xdot:
  y, ydot = jvp(g)(x, xdot)
  z, zdot = jvp(f)(y, ydot)
  return z, zdot

def lin(f . g) x:
  y, g_lin = lin(g)(x)
  z, f_lin = lin(f)(y)
  return z, lambda xdot: f_lin(g_lin(xdot))

def vjp(f . g) x:
  y, g_vjp = vjp(g)(x)
  z, f_vjp = lin(f)(y)
  return z, lambda zbar: g_vjp(f_vjp(zbar))

WDYT?

3 replies

Hrrsmjd Oct 11, 2022
Author

Thanks for your answer.

Sorry, by pass I meant an application of automatic differentiation.

That's not usually a good idea because forming dense Jacobians that way typically throws away sparsity structure, represented directly by the program as data dependence.

Doesn't automatic differentiation also form dense Jacobians? That is, if the output of $f_1$ depends only on $x_1$, automatic differentiation will compute $\partial f_1/\partial x_i$ for all $x_i$.

If the first derivative of the primitive g_0 involves applications of the primitive g_1, and the first derivative of g_1 involves applications of the primitive g_2, then we've got to generate g_2 somehow.

In this case, are you saying that we will not be able to compute the derivative of g_0 because the derivative of g_0 contains applications of g_1? I am not sure I follow. From my understanding, in forward-mode, we would evaluate $f_1$ at a specific point and then compute the derivative of $f_1$ and multiply it by some vector $v_0$. The vector $v_1$ is now equal to $f_1'v_0$, and we can repeat the process to get $f'v_0 = f_2'f_1'v_0$. I am unsure when $f_1'$ would contain applications of $f_2$ and what that would look like.

Hrrsmjd Oct 11, 2022
Author

Suppose $f:\mathbb{R}^n\rightarrow\mathbb{R}$ and we are interested in the Hessian. Then we can apply automatic differentiation twice, first on the function $f$ and then on the Jacobian $J_f:\mathbb{R}^n\rightarrow\mathbb{R}^n$—is that correct?

mattjj Oct 15, 2022
Maintainer

Doesn't automatic differentiation also form dense Jacobians?

No, actually! Consider the above example with jvp(sin). If x were a float32[1000] then the Jacobian of sin would be of shape f32[1000, 1000], though only nonzero on its diagonal. Yet we're not computing any array of that size; instead we're only computing its diagonal coefficients in cos(x).

That is, if the output of $f_1$ depends only on $x_1$, automatic differentiation will compute $\partial f_1/\partial x_i$ for all $x_i$.

No, it won't. For example, in reverse mode, if we write jax.grad(lambda x, y: x / y)(2., 3.) we never compute 3.**2 as we would if we were differentiating with respect to the second argument.

Suppose $f:\mathbb{R}^n\rightarrow\mathbb{R}$ and we are interested in the Hessian. Then we can apply automatic differentiation twice, first on the function $f$ and then on the Jacobian $J_f:\mathbb{R}^n\rightarrow\mathbb{R}^n$—is that correct?

Yes, that's how we'd do it using JVPs and/or VJPs.

There is something better you can do for higher-order AD, though I don't think it helps with the complexity of computing Hessians. It's to do with jax.experimental.jet, which is as yet undocumented but part of the story is explained in these notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to compute the Hessian in a single pass? #12735

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to compute the Hessian in a single pass? #12735

Uh oh!

Uh oh!

Hrrsmjd Oct 10, 2022

Replies: 1 comment · 3 replies

Uh oh!

mattjj Oct 11, 2022 Maintainer

Uh oh!

Uh oh!

Hrrsmjd Oct 11, 2022 Author

Uh oh!

Uh oh!

Hrrsmjd Oct 11, 2022 Author

Uh oh!

mattjj Oct 15, 2022 Maintainer

Hrrsmjd
Oct 10, 2022

Replies: 1 comment 3 replies

mattjj
Oct 11, 2022
Maintainer

Hrrsmjd Oct 11, 2022
Author

Hrrsmjd Oct 11, 2022
Author

mattjj Oct 15, 2022
Maintainer