Placement of JIT and performance relative to Pytorch #6769

LawrenceMMStewart · 2021-05-17T19:44:47Z

LawrenceMMStewart
May 17, 2021

Hello everyone, I hope you are all well.

I have just started using Jax and have written a simple Multi-Layer Perceptron following the structure provided in the documentation tutorial:

https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html

Whilst I indeed found that jitting a gradient step of my neural network results in a much faster performance, I found that the same gradient step in Pytorch runs significantly faster (approx 3.6 times faster for GPU / 2.7 times faster for CPU).

Is this to do with the way I am using jax.jit? From what I have read from the documentation, one wishes to jit a function that combines the maximum number of jax operations together for optimal performance. Hence, I have combined the forward and backward pass into a single function, which is then jitted.

I have attached the code below to recreate the results. The experiment consists of performing 1000 gradient steps for a MLP implemented in Jax and another MLP implemented in Torch. My code was ran from a Google Colab Session.

import torch
import numpy as np
import jax.numpy as jnp
import jax.random as random
from jax import jit, grad
from functools import partial
import math
import torch.nn as nn
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# MLP IN JAX
class jax_MLP():
    def __init__(self,H,key = random.PRNGKey(123)):
        keys = random.split(key,num=3)
        scale = 1/ math.sqrt(H)
        self.a = random.uniform(keys[0],minval = -1,maxval = 1,shape=(1,H)) * scale
        self.b = random.uniform(keys[1],minval = -1,maxval = 1,shape=(H,)) * scale
        self.w = random.uniform(keys[2],minval = -1,maxval = 1,shape=(H,1)) * scale

    def get_params(self):
        return [self.a,self.b,self.w]

    def forward(self,x,params):
        """
        params = [a,b,w]
        """
        a,b,w = params
        x1  =  jnp.dot(x,a) + b #first linear map with bias
        x2 = jnp.maximum(0,x1) #ReLU
        x3 = jnp.dot(x2,w) #second linear map no bias
        return x3

    #forward pass and calculate L2 Loss
    def forward_pass(self,x,y,params):
        preds = self.forward(x,params)
        return jnp.mean((preds-y)**2)

    @partial(jit,static_argnums=(0,))
    def forward_backward(self,x,y,params):
        loss = self.forward_pass(x,y,params)
        grads = grad(self.forward_pass,argnums=[2])(x,y,params)
        return loss,grads

    def update(self,grads,lr):
        da,db,dw = grads[0]

        self.a -= lr*da
        self.b -= lr*db
        self.w -= lr*dw

# MLP IN TORCH
class torch_MLP(nn.Module): 
    """
    Multi layer perceptron
    """
    def __init__(self,H=100,C = 1):
        super(torch_MLP, self).__init__()
        self.H = H 
        self.linear1 =  nn.Linear(1,H,bias=True)
        self.relu = nn.ReLU()

        self.linear2 =  nn.Linear(H,C,bias=False)
 
    def forward(self,x):
        y = self.linear1(x)
        y = self.relu(y)
        y = self.linear2(y)
        return y

tmlp = torch_MLP(H=1000).to(device)
jmlp = jax_MLP(1000)

#Learning Rate
LR = 1e-4 

#PYTORCH params
loss_fn = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(tmlp.parameters(),lr=LR)

# Define a single step of an MLP for both Jax and Torch

def torch_step(x,y):
  pred = tmlp(x)
  loss = loss_fn(pred,y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

def jax_step(x,y):
  params = jmlp.get_params()
  loss,grads = jmlp.forward_backward(x,y,params)
  jmlp.update(grads,LR)

# Functions to run 1000 epochs of gradient steps on random data

def epochs_torch():
  for i in range(1000):
    x = np.random.randn(100,1)
    y = np.random.randn(100,1)

    x = torch.Tensor(x).to(device)
    y = torch.Tensor(y).to(device)
    torch_step(x,y)


def epochs_jax():
    for i in range(1000):
      x = np.random.randn(100,1)
      y = np.random.randn(100,1)

      x = jnp.array(x)
      y = jnp.array(y)
      jax_step(x,y)

%timeit epochs_torch()
%timeit epochs_jax()

The results for running on CPU and GPU with a Google Colab session are as follows:

CPU

Pytorch MLP: 1 loop, best of 5: 962 ms per loop
Jax MLP: 1 loop, best of 5: 2.67 s per loop

Hence on CPU Torch is approximately 2.7 times faster

GPU

Pytorch MLP: 1 loop, best of 5: 753 ms per loop
Jax MLP: 1 loop, best of 5: 2.76 s per loop

Hence on GPU torch is approximately 3.6 times faster.

Many thanks for any help / pointers.

Answered by C-J-Cundy

May 17, 2021

I think the issue you have is that use of static_argnums with forwards_backwards. This re-compiles the function whenever the static input changes, which I think will change every time you update your instance of jax_MLP.

What I would recommend is to follow the philosophy followed by haiku and similar packages, where the parameters are passed around explicitly in the optimization loop, instead of updating a model object. This follows the jax philosophy of making functions functional.

Explicitly what I would recommend is changing update to

def update(params,grads,lr):
  a, b, w = params
  da,db,dw = grads[0]
  a -= lr*da
  b -= lr*db
  w -= lr*dw
  return a, b, w

unjitting forwards_backwards…

View full answer

C-J-Cundy · 2021-05-17T20:25:25Z

C-J-Cundy
May 17, 2021

I think the issue you have is that use of static_argnums with forwards_backwards. This re-compiles the function whenever the static input changes, which I think will change every time you update your instance of jax_MLP.

What I would recommend is to follow the philosophy followed by haiku and similar packages, where the parameters are passed around explicitly in the optimization loop, instead of updating a model object. This follows the jax philosophy of making functions functional.

Explicitly what I would recommend is changing update to

def update(params,grads,lr):
  a, b, w = params
  da,db,dw = grads[0]
  a -= lr*da
  b -= lr*db
  w -= lr*dw
  return a, b, w

unjitting forwards_backwards, and changing jax_step to

@jit
def jax_step(x,y, params):
  #params = jmlp.get_params()
  loss, grads = jmlp.forward_backward(x,y,params)
  return update(params, grads, LR)

which you can then jit, and then the actual timing loop would be

params = jmlp.get_params()
def epochs_jax():
    for i in range(1000):
      x = np.random.randn(100,1)
      y = np.random.randn(100,1)

      x = jnp.array(x)
      y = jnp.array(y)
      out_params = jax_step(x,y, params)
      out_params[0].block_until_ready()

Here I've added in a block_until_ready to make sure the timing is correct: see here for details

On my colab instance, this is now slightly faster than the pytorch implementation.

3 replies

LawrenceMMStewart May 17, 2021
Author

Thank you very much for this response, it was very clear!

With regards to the haiku philosophy of passing the params around explicitly, do users of Jax normally define a model class to store the layers? Or do they rather just have a single "train step" function that takes the parameter and data and runs all the operations in sequence?

Once again thank you for the help, I really do appreciate it!

C-J-Cundy May 17, 2021

The Haiku docs are pretty good, I'd recommend taking a look at them.

At a high level (ignoring complexities with state that you need for things like batchnorm) the way it works is you have a class which stores the layers which you can define in a similar way to a torch nn.Module or similar in tensorflow:

class MyModule(hk.Module):

  def __call__(self, x):

    w = hk.get_parameter("w", [], init=jnp.zeros)

    return x + w

you then call a function transform which gives you an apply function and an init function. The init function takes in an rng key and dummy input and gives you parameters. Calling apply with parameters gives you the output.

At a high level (you would have to wrap this with some without_state stuff for this to be valid haiku) the structure of most of my training code looks like this:

apply_fn, init_fn = transform(MyModule)
params = init_fn(rng, dummy_input)

def loss(params, x, y):
    prediction = apply_fn(params, x)
    return jnp.mean(prediction - y**2)

@jit
def step(params, x, y):
    grads = grad(loss)(params, x, y)
    return update(params, grads)

LawrenceMMStewart May 18, 2021
Author

Thank you very much Chris, I will take a look at the Haiku docs. Have a nice week!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Placement of JIT and performance relative to Pytorch #6769

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Placement of JIT and performance relative to Pytorch #6769

Uh oh!

LawrenceMMStewart May 17, 2021

CPU

GPU

Replies: 1 comment · 3 replies

Uh oh!

C-J-Cundy May 17, 2021

Uh oh!

LawrenceMMStewart May 17, 2021 Author

Uh oh!

Uh oh!

C-J-Cundy May 17, 2021

Uh oh!

LawrenceMMStewart May 18, 2021 Author

LawrenceMMStewart
May 17, 2021

Replies: 1 comment 3 replies

C-J-Cundy
May 17, 2021

LawrenceMMStewart May 17, 2021
Author

LawrenceMMStewart May 18, 2021
Author