Much slower runtime than Python / Numpy Implementation #11909

gavincyi · 2022-08-14T21:21:17Z

gavincyi
Aug 14, 2022

I am running a benchmark between Python / Numpy v.s. JAX implementation on simple linear regression.

The implementation is to iterate the weights W and bias b by a few matrix multiplications and sum of arrays. Its Python / numpy implementation is like the following

def linear_regression(x_train, y_train, num_epochs, learning_rate):
  input_dim = x_train.shape[1]
  output_dim = y_train.shape[1]
  N = y_train.shape[0]

  W = dW = 0.01 * np.random.randn(input_dim, output_dim)
  b = db = np.zeros((output_dim, 1))
  error = np.zeros((N, output_dim))

  for _ in range(num_epochs):
    # Error = train - prediction
    error = y_train - (x_train @ W + b)
    # Loss function = error ** 2 / N
    loss = (error.T @ error) / N
    # Gradient
    dW = - (2 / N) * (x_train.T @ error)
    db = - (2 / N) * error.sum(axis=0)
    # Update W and b
    W -= learning_rate * dW
    b -= learning_rate * db

  return W, b

In JAX implementation, it is pretty much the same but just to wrap the iteration into a closure function _body and use fori_loop to iterate.

def linear_regression_jax(x_train, y_train, num_epochs, learning_rate):
  input_dim = x_train.shape[1]
  output_dim = y_train.shape[1]
  N = y_train.shape[0]

  W = 0.01 * np.random.randn(input_dim, output_dim)
  b = np.zeros((output_dim, 1))

  def _body(_, val):
    # Forward pass [NX1] · [1X1] = [NX1]
    W, b = val

    # Loss
    error = y_train - (x_train @ W + b)
    loss = (error.T @ error) / N

    # Backpropagation
    dW = -(2/N) * (x_train.T @ error)
    db = -(2/N) * np.sum(error)

    # Update weights
    W += -learning_rate * dW
    b += -learning_rate * db

    return (W, b)

  # Training loop
  return lax.fori_loop(0, num_epochs, _body, (W, b))

It is staggering that the JAX implementation runs much slower than Numpy implementation.

For example, the training size is 10k and the regression dimension is 100, i.e. the most intensive operation is a matrix multiplication of (10k, 100) and (100, 1)

size = 10000
dim = 100
x_train = np.random.randn(size, dim)
y_train = np.asmatrix(x_train.dot(np.arange(1, dim + 1))).T + 0.001 * np.random.randn(size, 1)
num_epochs = 5000
learning_rate = 1e-1

The JAX runtime (13s on average) is 4 times slower than the numpy runtime (3s on average) in CPU.

The JAX version used is 0.3.14, and the benchmark colab notebook can be found in below.

https://colab.research.google.com/drive/17FY2z3Og7a36Ub4pal2vSdQ7FIzP9I34?usp=sharing

Answered by anh-tong

Aug 17, 2022

I wonder if this is related to this issue.

I find that just by the Python for-loop for jitted _body can improve the performance (2.18s)

@jit
def _body(_, val):
  # Forward pass [NX1] · [1X1] = [NX1]
  W, b = val

  # Loss
  error = y_train - (x_train @ W + b)
  loss = (error.T @ error) / N

  # Backpropagation
  dW = -(2/N) * (x_train.T @ error)
  db = -(2/N) * np.sum(error)

  # Update weights
  W += -learning_rate * dW
  b += -learning_rate * db

  return (W, b)

for _ in range(num_epochs):
  W, b = _body(_, (W, b))

View full answer

gavincyi · 2022-08-14T21:22:43Z

gavincyi
Aug 14, 2022
Author

Alternatively, I reimplemented the JAX one with scan but had no luck in improving its runtime

def linear_regression_jax(x_train, y_train, num_epochs, learning_rate):

  input_dim = x_train.shape[1]
  output_dim = y_train.shape[1]
  N = y_train.shape[0]

  W = 0.01 * np.random.randn(input_dim, output_dim)
  b = np.zeros((output_dim, 1), dtype=np.float64)

  def _body(val, _):
    # Forward pass [NX1] · [1X1] = [NX1]
    W, b = val

    # Loss
    error = y_train - (x_train @ W + b)
    loss = (error.T @ error) / N

    # Backpropagation
    dW = -(2/N) * (x_train.T @ error)
    db = -(2/N) * np.sum(error)

    # Update weights
    W += -learning_rate * dW
    b += -learning_rate * db

    return (W, b), (W, b)

  # Training loop
  return jax.lax.scan(_body, (W, b), np.arange(0, num_epochs))

0 replies

jakevdp · 2022-08-14T21:51:53Z

jakevdp
Aug 14, 2022
Maintainer

Have you read through FAQ: Is JAX Faster Than Numpy? The answer is not always yes, and there is some information there about JAX best practices, when you can expect JAX to be faster, as well as tips on getting accurate benchmarks.

0 replies

gavincyi · 2022-08-15T04:10:15Z

gavincyi
Aug 15, 2022
Author

Thanks for your prompt response. Yes I am also aware of asynchronous dispatch in JAX. Same for the documentation, my benchmark measures only the JAX runtime, and does not include the transfer time and compilation time.

Also, to amortize the JAX overhead in CPU, I increased the size of the matrices, but struggled to understand the runtime difference grows linearly with the size of the matrices.

3 replies

jakevdp Aug 15, 2022
Maintainer

I wasn't necessarily highlighting asynchronous dispatch (though that's important to be aware of when computing benchmarks) but the fact that we don't necessarily expect JAX on CPU to be faster than NumPy on CPU in all cases. Some of the reasons for this are discussed at the FAQ link.

Regarding the results here: can you provide the code you used to compute them? For example, it's not clear to me what inputs you are using, or what method you are using to compute the runtime at each step. Without those details it's hard to guess what the difference might be due to.

gavincyi Aug 15, 2022
Author

can you provide the code you used to compute them?

Please find the code in the benchmark colab notebook in the initial question or here

the fact that we don't necessarily expect JAX on CPU to be faster than NumPy on CPU in all cases.

I implemented it in Tensorflow but it produced a much better result (faster than Numpy and JAX). My understanding is both JAX and Tensorflow (with jit_compile=True) are both accelerated by XLA, so I didn't expect such a vast difference on performance in CPU.

jakevdp Aug 15, 2022
Maintainer

Thanks - sorry to miss that: I was looking over your question for any code showing how times were computed, and scanned past the hyperlinks.

I'd still recommend looking at the FAQ about JAX vs. NumPy speed: in particular, it tries to make the point that JAX is not magic. When you use JAX to execute matrix operations on CPU, it will eventually lower them to BLAS and LAPACK library calls, which is exactly what NumPy does. JAX can sometimes improve things by fusing and/or eliding operations after JIT compilation, but if your function consists of a sequence of straightforward matrix operations, there's not much to be gained there. Further, depending on how JAX and NumPy have been installed on your system, one or the other may use a more optimized BLAS/LAPACK library, and so one or the other may be faster just due to how it was compiled and installed. If tensorflow is faster, it may have been compiled against an optimized BLAS library – it really depends on the installation.

As for your code: I don't see anything obviously inefficient or incorrect – overall it looks fine to me. It's not entirely surprising that JAX is 2-3x slower than numpy on CPU: after all, this is exactly the type of operation NumPy has been designed and optimized for over the past two decades! JAX's main goal is not to be faster than NumPy on CPU. It sometimes is, and sometimes isn't, depending on the nature of the code being executed and how the two packages are compiled/installed on your system. But if you want to compute transformed versions of your function (e.g. autodiff, batching, etc) or execute code on GPU or TPU, I think you'll find that JAX can be useful. For example, when I run your benchmark on a GPU, without any modification to the code, the JAX benchmark is about 10x faster than numpy.

Hope that helps!

gavincyi · 2022-08-16T20:44:05Z

gavincyi
Aug 16, 2022
Author

Thanks so much for your detailed explanation! It makes total sense to me I may be comparing with different numerical libraries in background actually. For example, openblas was used in the numpy installed in the benchmark machine

openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']

while Tensorflow uses intel MKL (or Eigen?).

Do you know how to find out which numerical library is used in JAX? Or any documentation / source code I can refer to?

2 replies

jakevdp Aug 16, 2022
Maintainer

If you're curious about the CPU jaxlib wheels that you download from PyPI, everything can be seen here: https://github.com/google/jax/tree/97a5719fcb40af7231b5f803f965063538282f8e/build

I believe that JAX CPU ends up using Eigen via the XLA dependency, but I'm not positive about that.

gavincyi Aug 18, 2022
Author

Thanks for your explanation. Let me have a look.

anh-tong · 2022-08-17T08:53:43Z

anh-tong
Aug 17, 2022

I wonder if this is related to this issue.

I find that just by the Python for-loop for jitted _body can improve the performance (2.18s)

@jit
def _body(_, val):
  # Forward pass [NX1] · [1X1] = [NX1]
  W, b = val

  # Loss
  error = y_train - (x_train @ W + b)
  loss = (error.T @ error) / N

  # Backpropagation
  dW = -(2/N) * (x_train.T @ error)
  db = -(2/N) * np.sum(error)

  # Update weights
  W += -learning_rate * dW
  b += -learning_rate * db

  return (W, b)

for _ in range(num_epochs):
  W, b = _body(_, (W, b))

1 reply

jakevdp Aug 17, 2022
Maintainer

The difference there is that a python for-loop will be unrolled during tracing. This can lead to faster runtimes because it gives the XLA compiler more flexibility to fuse computations across iterations, but you'll generally find that compilation times will be much longer (roughly quadratic in num_epochs)

gavincyi · 2022-08-18T11:25:17Z

gavincyi
Aug 18, 2022
Author

Thanks @anh-tong and I confirm the performance is much better without jax fori_loop / scan. It seems the root cause coming from it.

you'll generally find that compilation times will be much longer (roughly quadratic in num_epochs)

The implementation suggested by @anh-tong does not require num_epochs in the jax jit function, so I assume the compilation time is static.

0 replies

Much slower runtime than Python / Numpy Implementation #11909

Uh oh!

gavincyi Aug 14, 2022

Replies: 6 comments · 6 replies

Uh oh!

gavincyi Aug 14, 2022 Author

Uh oh!

jakevdp Aug 14, 2022 Maintainer

Uh oh!

gavincyi Aug 15, 2022 Author

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

gavincyi Aug 15, 2022 Author

Uh oh!

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

Uh oh!

gavincyi Aug 16, 2022 Author

Uh oh!

jakevdp Aug 16, 2022 Maintainer

Uh oh!

gavincyi Aug 18, 2022 Author

Uh oh!

Uh oh!

anh-tong Aug 17, 2022

Uh oh!

jakevdp Aug 17, 2022 Maintainer

Uh oh!

gavincyi Aug 18, 2022 Author

gavincyi
Aug 14, 2022

Replies: 6 comments 6 replies

gavincyi
Aug 14, 2022
Author

jakevdp
Aug 14, 2022
Maintainer

gavincyi
Aug 15, 2022
Author

jakevdp Aug 15, 2022
Maintainer

gavincyi Aug 15, 2022
Author

jakevdp Aug 15, 2022
Maintainer

gavincyi
Aug 16, 2022
Author

jakevdp Aug 16, 2022
Maintainer

gavincyi Aug 18, 2022
Author

anh-tong
Aug 17, 2022

jakevdp Aug 17, 2022
Maintainer

gavincyi
Aug 18, 2022
Author