Computing derivatives with respect to input variable of individual activations in a feedforward network using jax #10031

jdongg · 2022-03-25T00:04:11Z

jdongg
Mar 25, 2022

Hi all, apologies if this isn't the best place to ask this. If it's not, I'd really appreciate if someone could suggest a platform where I might ask this question. For my work, I mainly look at function approximation and PDE approximation problems using NNs and I actually need to compute quantities like (for simplicity, considering a network with a single hidden layer where W_i and b_i are the ith components of the weight and bias)

d/dx sigma(x * W_i + b_i)

for all i, evaluated for some array x. I could never find a satisfactory/efficient way to do this in tensorflow, and I know this isn't a common thing that most users need to do.

When there's only a single hidden layer, it's easy enough for me to do this manually without auto differentiation. Things obviously get pretty sloppy with more than one layer so I was hoping to leverage jax to do this for me. I've included a hopefully minimal example below that just has two hidden layers and one-dimensional input with the same number of neurons in each layer. The main issue is that this ends up being quite slow, especially if the array x is large and/or the width of each layer is relatively large. I need to more or less do this computation for each epoch. Is there a more efficient way for me to be doing this computation with jax?

def MLP(x, neurons):
	
        # first hidden layer
	a = 1.0/np.sqrt(neurons)
	W0 = np.random.uniform(-a, a, size=[1, neurons])
	b0 = np.random.uniform(-a, a, size=[1,neurons])
	x1 = jnp.tanh(jnp.matmul(x, W0) + b0)

        # second hidden layer
	W1 = np.random.uniform(-a, a, size=[neurons,neurons])
	b1 = np.random.uniform(-a, a, size=[1,neurons])
	x2 = jnp.tanh(jnp.matmul(x1, W1) + b1)

	return x2.T

neurons = 500
n = 1000 # number of points in x
x = np.reshape(np.linspace(0.0, 1.0, n), [n,1])

grad_MLP = jax.jacfwd(MLP, 0)
y = jnp.squeeze(grad_MLP(x, neurons))

# process results
grad_MLP_array = np.zeros([n, neurons])
for i in range(n):
	grad_MLP_array[i,:] = y[:,i,i]

Here's a quick vectorized version using vmap if it helps anyone.

neurons = 480
a = 1.0/np.sqrt(neurons)

W = []
W.append(np.random.uniform(-a, a, size=[1, neurons]))
W.append(np.random.uniform(-a, a, size=[neurons,neurons]))

b = []
b.append(np.random.uniform(-a, a, size=[1,neurons]))
b.append(np.random.uniform(-a, a, size=[1,neurons]))

def MLP(x, W, b):
	
	x1 = jnp.tanh(x * W[0] + b[0])
	x2 = jnp.tanh(jnp.matmul(x1, W[1]) + b[1])

	return x2

n = 10000 # number of points in x
x = np.reshape(np.linspace(-1.0, 1.0, n), [n,1])

grad_MLP = jax.vmap(jax.jacfwd(MLP,0), (0, None, None), 0)
grad_MLP_eval = jnp.squeeze(MLP(x, W, b))

YouJiacheng · 2022-03-25T05:14:34Z

YouJiacheng
Mar 25, 2022

Sorry to be dense. Could you give a general mathematical formula for your problem?
I don't know what is "ith components of the weight and bias". Is it the i-th row of weight and bias, i.e. W[i] and b[i]?
And what is "input variable of individual activations " in "derivatives with respect to input variable of individual activations"
I guess: you want to compute derivatives of a R -> R^m function at many points? (from your example)

import jax
import jax.numpy as jnp
w = jnp.ones((50, 1))
def f(x):
    assert x.shape == (1,)
    return w @ x

xs = jnp.ones((1000, 1))
grads = jax.vmap(jax.jacfwd(f))(xs)
assert grads.shape == (1000, 50, 1)

5 replies

jdongg Mar 25, 2022
Author

Sorry, I was probably unclear in my description. Let me just give the mathematical description of my problem for a network with one hidden layer. Let W be R^(1 x n), b be R^(1 x n), and c be R^(n x 1). I'm writing the output a feedforward NN with one hidden layer (W,b) and activation fn sigma as the function u : R --> R:

u = sigma(x * W + b) * c
= sum_{i=1}^n c_i * sigma(x * W_i + b_i).

For this case, W_i and b_i would simply be scalars since the input is 1D. In 2D, W would be R^(2 x n) so that W_i = W[:,i], etc. I'm interested in computing the derivatives of each of the n component functions {sigma(x * W_i + b_i)}_{i=1}^n wrt x (in contrast with simply computing the derivative of u wrt x, which I think would be pretty straightforward), and evaluating each of these derivatives at many points x. So I guess if one just considers

sigma(x * W + b)

where x is an N x 1 array consisting of N points, then sigma(x*W + b) would be an N x n array and you can view column j as the jth function sigma(x * W_j + b_j) evaluated over all the N points. I would like to do some computation to return an N x n array where column j is d/dx sigma(x * W_j + b_j). The example code I provided above should do it for a NN with two hidden layers, but I don't think the way I did it is very efficient at all

YouJiacheng Mar 25, 2022

Emmm. "I would like to do some computation to return an N x n array where column j is d/dx sigma(x * W_j + b_j)"
Thus you want to compute Jacobian matrix of a function f: R^N --> R^n?
Edited: Actually some "diagonal" of jacobian of f: R^N --> R^(N×n). (I think the reply "Essentially, yes." is wrong).

jdongg Mar 26, 2022
Author

Essentially, yes. But really the function can be thought of as R^1 --> R^n, and I just want to evaluate its jacobian at N different points xi \in R^1. But the way I'm implementing it right now, jacfwd of course returns an array that is (n x N) x N, essentially one matrix for each of the N points. This implementation is really inefficient because the vast majority of the output is just zeros. If the input is x = (x1, x2, ..., xN) and we consider d/dx1 of the output, then only the first row of the output of tanh_array depends on x1, hence if we do

y = jnp.squeeze(grad_tanh(x, neurons))

and look at y[:,:,i], only column i of y[i] will be nonzero. That's why I have the post-processing step in my code. It's inefficient because I believed a massive amount of memory is allocated for y, which is size n x N x N, but y only contains n x N nonzero entries which I have to extract.

YouJiacheng Mar 26, 2022

Oh. It seems that my guess in my answer is right: You want to evaluate jacobian of a R^1 --> R^n function at N different points in parallel but don't want to treat it as a R^(N×1) --> R^(N×n) function, whose jacobian has many zeros entries.

import jax
import jax.numpy as jnp
n = 50
N = 1000
w = jnp.ones((n, 1))
def f(x): # R^1 --> R^n
    assert x.shape == (1,)
    return w @ x

xs = jnp.ones((N, 1)) # N different points
grads = jax.vmap(jax.jacfwd(f))(xs)
assert grads.shape == (N, n, 1) # n×1 jacobian at N different points

jdongg Mar 29, 2022
Author

Thanks! vmap was the trick. I added the vmap version to my original post in case it helps anyoe.

Computing derivatives with respect to input variable of individual activations in a feedforward network using jax #10031

Uh oh!

Uh oh!

jdongg Mar 25, 2022

Replies: 1 comment · 5 replies

Uh oh!

YouJiacheng Mar 25, 2022

Uh oh!

Uh oh!

jdongg Mar 25, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Mar 25, 2022

Uh oh!

Uh oh!

jdongg Mar 26, 2022 Author

Uh oh!

YouJiacheng Mar 26, 2022

Uh oh!

jdongg Mar 29, 2022 Author

jdongg
Mar 25, 2022

Replies: 1 comment 5 replies

YouJiacheng
Mar 25, 2022

jdongg Mar 25, 2022
Author

jdongg Mar 26, 2022
Author

jdongg Mar 29, 2022
Author