Best practice of `experimental.jet` to evaluate laplacian of an scalar valued MLP #9598

YouJiacheng · 2022-02-16T15:32:50Z

YouJiacheng
Feb 16, 2022

This paper show that jet can accelerate high order differentiation of Two-layer MLP with exp non-linearities.
But I cannot find the example code of this paper. How can I use jet to accelerate laplacian computation of MLP?
Even if the non-linearities are not exp?
@mattjj

pnkraemer · 2022-02-18T17:06:48Z

pnkraemer
Feb 18, 2022

Hi @YouJiacheng! I'll let the experts answer as well, but let me add some quick pointers, maybe this is already helpful :)

The documentation introduced in the pull request Include experimental.jet into docs #9546 displays an example of using jet for a simple 1d function. It might be a good starting point to work with the snippets in there, and to try to extend them to multivariate functions.
While you can certainly use jet to compute Hessians, in my experience, jet truly shines for third and (much) higher order derivatives; say, 10th order derivatives. @mattjj mentions similar experiences in another discussion, which also links to some applications of jet (Document experimental.jet #3730 (comment)). But someone else might well have had different experiences.

7 replies

YouJiacheng Feb 28, 2022
Author

@pnkraemer
Oh, thanks! For the first snippet, I suppose that jet will not recognize the dimension of input/output (jacfwd/jacrev do this by std_basis), it just transforms the computation using a tracer. Thus jet.jet(f, (1.0,), ((x, 0.0),)) is equivalent to jet.jet(f, (x,), ((x, 0.0),)).
For the second snippet, can I think jet for high dimension input function is exactly the taylor expansion? For jet.jet(f, (x,), ((v, jnp.zeros_like(x)),)) the first term is f(x), the second term is grad(f)(x) @ v, the third term is (hessian(f)(x) @ v) @ v.

pnkraemer Feb 28, 2022

Sounds reasonable:

>>> x = jnp.ones((2,))
>>> v = jnp.arange(1., 3.)
>>> zeros = jnp.zeros_like(v)
>>>
>>> f = lambda s: jnp.square(jnp.sum(s))
>>>
>>> jet.jet(f, (x,), ((v,zeros,zeros),))
(DeviceArray(3.999998, dtype=float32), [DeviceArray(11.999994, dtype=float32), DeviceArray(17.99999, dtype=float32), DeviceArray(0., dtype=float32)])
>>>
>>> df, ddf, d3f = jax.jacfwd(f), jax.hessian(f), jax.jacfwd(jax.hessian(f))
>>> f(x), df(x) @ v, (ddf(x) @ v) @ v, ((d3f(x) @ v) @ v) @ v
(DeviceArray(4., dtype=float32), DeviceArray(12., dtype=float32), DeviceArray(18., dtype=float32), DeviceArray(0., dtype=float32))

YouJiacheng Feb 28, 2022
Author

I have seen these notes before, but it is the first time I understand it! Thanks!🥰
BTW, In this repo there are iterative use of jet, but does not change the primals. What is the iteration compute for? Just curious.

YouJiacheng Feb 28, 2022
Author

😭I using this method to compute laplacian of a mlp(exp activation) as the paper suggest, but it is about 1.4x slower than jnp.trace(jax.hessian(mlp)(x)). But the paper said it is 10x faster when the order of differentiation is 2. How can that be!😭😭😭😭😭😭😭😭😭

def laplacian_1(fun, x: jnp.ndarray):
    @jax.vmap
    def hvv(v):
        return jet.jet(fun, (x,), ((v, jnp.zeros_like(x)),))[1][1]
    return jnp.sum(hvv(jnp.eye(x.shape[0], dtype=x.dtype)))

def laplacian_2(fun, x: jnp.ndarray):
    in_tangents = jnp.eye(x.shape[0], dtype=x.dtype)
    pushfwd = partial(jax.jvp, jax.grad(fun), (x,))
    _, hessian = jax.vmap(pushfwd, out_axes=(None, 0))((in_tangents,))
    return jnp.trace(hessian)

def laplacian_3(fun, x: jnp.ndarray):
    return jnp.trace(jax.hessian(fun)(x))

pnkraemer Mar 1, 2022

What is the iteration compute for? Just curious

It implements the computation of high order derivatives of ODE solutions recursively. See Section 4 in the paper you linked.

YouJiacheng · 2022-02-28T12:59:56Z

YouJiacheng
Feb 28, 2022
Author

This paper said it is 10x faster when the order of differentiation is 2.

I don't know if it is laplacian-like differentiation operator(result is a scalar), or it is hessian-like operator(result is a high order tensor). Since the (k+1)-th term of jet is equal to
(...jacfwd(...jacfwd(f)...) @ v...) @ v (k jacfwd and k @ v), I think it is laplacian-like differentiation operator.

I use following method to compute laplacian of a mlp(exp activation, as the paper suggested), it is about 1.4x slower than jnp.trace(jax.hessian(mlp)(x)) for 64+1 layers MLP.
jet based laplacian is 10x faster than jnp.trace(jax.jacfwd(jax.jacfwd(mlp))(x)), and enjoy the same advantage that the memory consumption is constant w.r.t. computation depth, while jnp.trace(jax.hessian(mlp)(x)) does not.
I find that 10x faster is true if len(ws) == 1, which means 1+1 layers MLP, as the paper suggested.
I wonder why jet based laplacian is 10x faster for 1+1 layers MLP but 1.4x slower for 64+1 layers MLP, comparing with jnp.trace(jax.hessian(mlp)(x)).

from functools import partial

import jax
import jax.numpy as jnp
from jax.experimental import jet
# jet.fact = lambda n: jax.lax.prod(range(1, n + 1))

def f(ws, wo, x):
    for w in ws:
        x = jax.lax.exp(x @ w)
    return jnp.reshape(x @ wo, ())

@jax.jit
@partial(jax.vmap, in_axes=(None, None, 0))
def laplacian_1(ws, wo, x):
    fun = partial(f, ws, wo)
    @jax.vmap
    def hvv(v):
        return jet.jet(fun, (x,), ((v, jnp.zeros_like(x)),))[1][1]
    return jnp.sum(hvv(jnp.eye(x.shape[0], dtype=x.dtype)))

@jax.jit
@partial(jax.vmap, in_axes=(None, None, 0))
def laplacian_2(ws, wo, x):
    fun = partial(f, ws, wo)
    in_tangents = jnp.eye(x.shape[0], dtype=x.dtype)
    pushfwd = partial(jax.jvp, jax.grad(fun), (x,))
    _, hessian = jax.vmap(pushfwd, out_axes=(None, 0))((in_tangents,))
    return jnp.trace(hessian)

@jax.jit
@partial(jax.vmap, in_axes=(None, None, 0))
def laplacian_3(ws, wo, x):
    fun = partial(f, ws, wo)
    return jnp.trace(jax.hessian(fun)(x))

def timer(f):
    from time import time
    f() # compile
    t = time()
    for _ in range(3):
        f()
    print((time() - t) / 3)

d = 256
ws = [jnp.zeros((d, d)) for _ in range(64)]
wo = jnp.zeros((d, 1))
x = jnp.zeros((512, d))

timer(lambda : jax.block_until_ready(laplacian_1(ws, wo, x)))
timer(lambda : jax.block_until_ready(laplacian_2(ws, wo, x)))
timer(lambda : jax.block_until_ready(laplacian_3(ws, wo, x)))

2 replies

mattjj Mar 14, 2022
Maintainer

@jessebett do you remember more details about this?

I would suspect jet only becomes faster at orders 3 and higher, and even then it's an asymptotic behavior and may not predict the behavior on particular examples.

YouJiacheng Mar 14, 2022
Author

@mattjj
I find that though jet isn't faster for deep MLP, it saves lots of memory (can use >4x larger batch for d=256, 64+1 layers MLP).
It seems that memory cost of jet not scale with computation depth, since taylor-mode is a generalization of forward-mode
And jet is 10x faster than nested jacfwd. I think that the plot in the paper is jet v.s. nested jacfwd.
In addition, jet is 10x faster than jacfwd(jacrev(fun)) for 1+1 layers MLP

Best practice of experimental.jet to evaluate laplacian of an scalar valued MLP #9598

Uh oh!

Uh oh!

YouJiacheng Feb 16, 2022

Replies: 2 comments · 9 replies

Uh oh!

pnkraemer Feb 18, 2022

Uh oh!

Uh oh!

YouJiacheng Feb 28, 2022 Author

Uh oh!

pnkraemer Feb 28, 2022

Uh oh!

YouJiacheng Feb 28, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Feb 28, 2022 Author

Uh oh!

pnkraemer Mar 1, 2022

Uh oh!

Uh oh!

YouJiacheng Feb 28, 2022 Author

Uh oh!

mattjj Mar 14, 2022 Maintainer

Uh oh!

Uh oh!

YouJiacheng Mar 14, 2022 Author

Best practice of `experimental.jet` to evaluate laplacian of an scalar valued MLP #9598

YouJiacheng
Feb 16, 2022

Replies: 2 comments 9 replies

pnkraemer
Feb 18, 2022

YouJiacheng Feb 28, 2022
Author

YouJiacheng Feb 28, 2022
Author

YouJiacheng Feb 28, 2022
Author

YouJiacheng
Feb 28, 2022
Author

mattjj Mar 14, 2022
Maintainer

YouJiacheng Mar 14, 2022
Author